Encoder-Decoder Models Overview
Encoder-Decoder Models Overview
Mitesh M. Khapra
1/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.1: Introduction to Encoder Decoder Models
2/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We will start by revisiting the
am at today ⟨ stop ⟩
I home
problem of language modeling
yt yt yt yt yt yt P(yt = j|yt−1 )
1
Informally, given ‘t − i’ words we are
interested in predicting the tth word
V V V V V V
More formally, given y1 , y2 , ..., yt−1 we
W W W W W
s0
st want to find
3/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We are interested in
I am at home today ⟨ stop ⟩
yt P(yt = j|yt−1
1 ) P(yt = j|y1 , y2 ...yt−1 )
P(yt = j|yt−1
1 ) = P(yt = j|st )
= softmax(Vst + c)j
U U U U U U Parameters: U, V, W, b, c
xt Loss:
<GO> I am at home today
∑
T
L (θ) = Lt (θ)
t=1
India, officially the Republic Lt (θ) = − log P(yt = ℓt |yt−1
1 )
of India, is a country in South
Data: where ℓt is the true word at time step
Asia. It is the seventh-largest
country by area, ..... t
5/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
What is the input at each time step?
o/p: I am at home today <stop>
It is simply the word that we
predicted at the previous time step
In general
st st = RNN(st−1 , xt )
U U U U U U
xt
7/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
st = σ(U xt + Wst−1 + b)st s˜t = σ(W(ot ⊙ st−1 ) + Uxt + b) s˜t = σ(W ht−1 + Uxt + b)
ht st = it ⊙ st−1 + (1 − it ) ⊙ s˜t st = ft ⊙ st−1 + it ⊙ s˜t
ht ht ht = ot ⊙ σ(st )
Before moving on we will see a compact way of writing the function computed
by RNN, GRU and LSTM
We will use these notations going forward
8/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
So far we have seen how to model the
A man throwing ⟨ stop ⟩
conditional probability distribution
. . . yt P(yt = j|yt−1 )
P(yt |yt−1
1 )
1
9/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Earlier we modeled P(yt |yt−1
1 ) as
A man throwing ⟨ stop ⟩
10/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
There are many ways of making P(yt = j) conditional on fc7 (I)
Let us see two such options
11/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man throwing ⟨ stop ⟩
. . . yt P(yt = j|yt−1
1 , I)
V V V V
Option 1: Set s0 = fc7 (I)
W W W
. . . W
sT Now s0 and hence all subsequent st ’s
s0 = fc7 (I) depend on fc7 (I)
U U U U
We can thus say that P(yt = j)
. . . xT depends on fc7 (I)
<GO> A man park In other words, we are computing
CNN P(yt = j|st , fc7 (I))
Option 1
12/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Option 2: Another more explicit
A man throwing ⟨ stop ⟩
way of doing this is to compute
. . . yt P(yt = j|yt−1
1 , I)
CNN
Option 2
13/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W
. . .W st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN
14/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W W
st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN
Alternatively, the encoder’s
output can be fed to every step
of the decoder
15/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.2: Applications of Encoder Decoder models
16/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
For all these applications we will try to answer the following questions
What kind of a network can we use to encode the input(s)? (What is an
appropriate encoder?)
What kind of a network can we use to decode the output? (What is an
appropriate decoder?)
What are the parameters of the model ?
What is an appropriate loss function ?
17/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Task: Image captioning
Decoder Lt (θ) = − log P(yt = j|yt−1 , fc7 )
Data: {xi = imagei , yi = captioni }N
1
. . . yt P(yt = j|yt−1
1 , fc7 ) Model:
V V V V
Encoder:
W W W
s0 = CNN(xi )
. . .W st
Encoder Decoder:
U U U U
h0
st = RNN(st−1 , e(ŷt−1 ))
. . . xt
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
x1 x2 x3 x4
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i=1 t=1
i/p:It is raining outside
Algorithm: Gradient descent with
i/p : It is raining outside backpropagation
20/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : Mein ghar ja raha hoon
Task: Machine translation
o/p:Mein ghar ja raha hoon
Data: {xi = sourcei , yi = targeti }N
i=1
Model:
Encoder:
s
ĥI = CNN(I), h̃t = RNN(h̃t−1 , qit )
h̃t ĥI
s = [h̃T ; ĥI ]
Decoder:
P(y|q, I) = softmax(Vs + b)
What is the bird’s color
Parameters: V, b, Uq , Wq , Wconv , b
Loss:
CNN L (θ) = − log P(y = ℓ|I, q)
Model:
Encoder:
st
ht = RNN(ht−1 , xit )
1 0 0 1 0 0
Decoder:
0 0 0 0 1 1
0
0
1
0
0
1
0
0
0
0
0
0 s0 = hT
0 0 0 0 0 0
<Go> India won the world cup
st = RNN(st−1 , e(ŷt−1 ))
c
P(yt |yt−1
1 , x) = softmax(Vst + b)
. . . . . . ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
. . . . . . ∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i/p:India beats . . . . . . Srilanka
i=1 t=1
i/p : India beats Srilanka to win ICC WC 2011.
Algorithm: Gradient descent with
Dhoni and Gambhir’s half centuries help beat SL
backpropagation 26/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man walking on a rope
o/p : A man walking on a rope Task: Video Captioning
Data: {xi = videoi , yi = desci }N
i=1
Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
Decoder:
s0 = hT
. . . st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
. . .
Decoder:
s = hT
. . . P(y|I) = softmax(Vs + b)
Parameters: V, b, Wconv , Uenc , Wenc , b
CNN CNN . . . CNN
Loss:
L (θ) = − log P(y = ℓ|Video)
. . .
Algorithm: Gradient descent with
backpropagation
28/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am fine <STOP>
o/p: I am fine Task: Dialog
Data: {xi = Utterancei , yi =
Responsei }N
i=1
Model:
st
Encoder:
1 0 0 1
ht = RNN(ht−1 , xit )
0
0
0
1
0
0
0
0
Decoder:
0 0 1 0
0
<Go>
0
I
0
am
0
fine
s0 = hT (T is length of input)
c
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
30/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.3: Attention Mechanism
31/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us motivate the task of attention with
o/p : I am going home the help of MT
o/p: I am going home <STOP> The encoder reads the sentences only once
and encodes it
At each timestep the decoder uses this
si
embedding to produce a new word
Is this how humans translate a sentence ?
1
0
0
0
0
0
1
0
0
1
Not really!
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home
hi
x1 x2 x3 x4 x5
i/p:
Main ghar ja raha hoon
33/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us revisit the decoder that we have
o/p : I am going home seen so far
o/p: I am going home <STOP> We either feed in the encoder information
only once(at s0 )
Or we feed the same encoder information
si
at each time step
Now suppose an oracle told you which
1
0
0
0
0
0
1
0
0
1
words to focus on at a given time-step t
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 Can you think of a smarter way of feeding
<Go> I am going home
information to the decoder?
c
hi
x1 x2 x3 x4 x5
i/p:
Main ghar ja raha hoon
x1 x2 x3 x4 x5
x1 x2 x3 x4 x4
+ ct
hi
x1 x2 x3 x4 x4
42/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am going home <STOP>
Task: Machine Translation
Data: {xi = sourcei , yi = targeti }N
i=1
Encoder:
ht = RNN(ht−1 , xt )
s0 = hT
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0
Decoder:
0 0 1 0 0
0 0 0 0 0 ejt = VT
attn tanh(Uattn hj + Wattn st )
<Go> I am going home
αjt = softmax(ejt )
+ c2 + c3 + c4 + c5
∑
T
α1,2
1,3 α1,5
2,3 α2,4
2,2
1,4 3,2 α3,4
3,3 α2,5 α3,5
4,2 4,3 α4,5 α5,4
5,2
4,4 5,3 α5,5 ct = αjt hj
j=1
hi
st = RNN(st−1 , [e(ŷt−1 ), ct ])
ℓt = softmax(Vst + b)
x1 x2 x3 x4 x5
Parameters: Udec , V, Wdec , Uenc , Wenc , b,
i/p:Main ghar ja raha hoon Uattn , Vattn
Loss and Algorithm remains same
43/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
You can try adding an attention component to all the other encoder decoder
models that we discussed earlier and answer the same set of questions (data,
encoder, decoder, loss, training algorithm)
44/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Can we check if the attention model actually learns something meaningful ?
In other words does it really learn to focus on the most relevant words in the
input at the t-th timestep ?
We can check this by plotting the attention weights as a heatmap (we will see
some examples on the next slide)
45/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based Figure: Example output of attention-based
summarization system [Rush et al. 2015.] neural machine translation model [Cho et al.
2015].
The heat map shows a soft alignment between the input and the generated
output.
Each cell in the heat mapsssss corresponds to αtj (i.e., the importance of the
jth input word for predicting the tth output word as determined by the model)
46/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based video captioning system [Yao et al. 2015.] 47/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.4: Attention over images
48/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
A man throwing for images?
a frisbee in a park
49/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
o/p:
main ghar ja raha hoon <STOP> for images?
In the case of text we have a
representation for every location (time
hi
step) of the input sequence
1 0 0 1 0 0
0 0 0 0 1 1
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
<Go> main ghar ja raha hoon
+ ct
α1 α2 α3 α4
hi
x1 x2 x3 x4
51/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map
softmax
4
2
22
22
11
11
56
56
28
28
14
14
7
28
14
14
7
28
56
56
112
112
512
224
224
512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096
52/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map
+ We could think of this as 196 locations
(each having a 512 dimensional
αt1 αt196 representation)
512 The model will then learn an attention
… over these locations (which in turn
1 2 196 correspond to actual locations in the
images)
14
512
14
53/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at some examples of attention over images for the task of image
captioning
54/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Examples of the attention-based model attending to the correct object (white
indicates the attended regions,underlines indicates the corresponding word) [Kyunghyun
Cho et al. 2015.]
55/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.5: Hierarchical Attention
56/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Consider a dialog between a user (u)
and a bot (B)
Context
The dialog contains a sequence of
U: Can you suggest a good movie? utterances between the user and the
B: Yes, sure. How about Logan? bot
U: Okay, who is the lead actor? Each utterance in turn is a sequence
of words
Response Thus what we have here is a
B: Hugh Jackman, of course “sequence of sequences” as input
Can you think of an encoder for such
a sequence of sequences?
57/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p:Hugh Jackman of course <STOP>
We could think of a two level
hierarchical RNN encoder
The first level RNN operates on the
sequence of words in each utterance
and gives us a representation
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
We now have a sequence of utterance
0
0
<Go>
0
0
I
1
0
am
0
0
going
0
0
home
representations (red vectors in the
image)
We can now have another RNN
which encodes this sequence and
gives a single representations for the
sequences of utterances
The decoder can then produce an
… … … output sequence conditioned on this
Can you movie? Yes sure Logan? Okay who actor? utterance
58/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at another example
Politics is the process of making decisions
applying to all members of each group. Consider the task of document
More narrowly, it refers to achieving and … classification or summarization
A document is a sequence of sentences
Politics
… … …
Politics is decisions applying to group More narrowlyand
59/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Politics is the process of making decisions
Word level (1) encoder:
applying to all members of each group.
More narrowly, it refers to achieving and … h1ij = RNN(h1ij−1 , wij )
si = h1iTi [T is length of sentence i]
Politics
Decoder:
P(y|document) = softmax(Vs + b)
60/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How would you model attention in
such a hierarchical encoder decoder
model ?
We need attention at two levels
First we need to attend to important
(most informative) words in a
sentence
Then we need to attend to important
(most informative) sentences in a
document
Let us see how to model this
61/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Word level (1) encoder:
hij = RNN(hij−1 , wij )
uij = tanh(Ww hij + bw )
exp(uT ij uw )
αij = ∑ T
t exp(uit uw )
∑
si = αij hij
j
62/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Decoder:
P(y|document) = softmax(Vs + b)
Parameters:
Ww , Ws , V, bw , bs , b, uw , us
Loss: cross entropy
Algorithm: Gradient Descent and
backpropagation
63/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16