0% found this document useful (0 votes)
72 views63 pages

Encoder-Decoder Models Overview

The document discusses Encoder-Decoder models and the Attention Mechanism in deep learning, focusing on language modeling and image captioning. It explains how to model conditional probabilities using Recurrent Neural Networks (RNNs) and introduces the architecture for generating sentences from images. Additionally, it outlines applications of these models, including tasks like image captioning and textual entailment, while emphasizing the importance of encoder and decoder networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views63 pages

Encoder-Decoder Models Overview

The document discusses Encoder-Decoder models and the Attention Mechanism in deep learning, focusing on language modeling and image captioning. It explains how to model conditional probabilities using Recurrent Neural Networks (RNNs) and introduces the architecture for generating sentences from images. Additionally, it outlines applications of these models, including tasks like image captioning and textual entailment, while emphasizing the importance of encoder and decoder networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS7015 (Deep Learning) : Lecture 16

Encoder Decoder Models, Attention Mechanism

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.1: Introduction to Encoder Decoder Models

2/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We will start by revisiting the
am at today ⟨ stop ⟩
I home
problem of language modeling
yt yt yt yt yt yt P(yt = j|yt−1 )
1
Informally, given ‘t − i’ words we are
interested in predicting the tth word
V V V V V V
More formally, given y1 , y2 , ..., yt−1 we
W W W W W
s0
st want to find

U U U U U U y∗ = argmax P(yt |y1 , y2 , ..., yt−1 )


xt
Let us see how we model
am
<GO> I at home today
P(yt |y1 , y2 ...yt−1 ) using a RNN
We will refer to P(yt |y1 , y2 ...yt−1 ) by
shorthand notation: P(yt |yt−1 1 )

3/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We are interested in
I am at home today ⟨ stop ⟩

yt P(yt = j|yt−1
1 ) P(yt = j|y1 , y2 ...yt−1 )

V V V V V V where j ∈ V and V is the set of all


vocabulary words
W W W W W
st
s0 Using an RNN we compute this as
U U U U U U
P(yt = j|yt−1
1 ) = softmax(Vst + c)j
xt

<GO> I am at home today In other words we compute

P(yt = j|yt−1
1 ) = P(yt = j|st )
= softmax(Vst + c)j

Notice that the recurrent connections


ensure that st has information about
yt−1
1 4/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: All sentences from any large
I am at home today ⟨ stop ⟩
corpus (say wikipedia)
yt P(yt = j|yt−1 )
1
Model:
V V V V V V
st = σ(Wst−1 + Uxt + b)
t−1
W W W W W
st P(yt = j|y1 ) = softmax(Vst + c)j
s0

U U U U U U Parameters: U, V, W, b, c
xt Loss:
<GO> I am at home today

T
L (θ) = Lt (θ)
t=1
India, officially the Republic Lt (θ) = − log P(yt = ℓt |yt−1
1 )
of India, is a country in South
Data: where ℓt is the true word at time step
Asia. It is the seventh-largest
country by area, ..... t
5/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
What is the input at each time step?
o/p: I am at home today <stop>
It is simply the word that we
predicted at the previous time step
In general
st st = RNN(st−1 , xt )

1 0 0 1 0 0 Let j be the index of the word


0 0 0 0 1 0
0
0
1
0
0
1
0
0
0
0
0
1 which has been assigned the max
0 0 0 0 0 0
<GO> probability at time step t − 1
xt = e(vj )

xt is essentially a one-hot vector


(e(vj ))representing the jth word in the
vocabulary
In practice, instead of one hot
representation we use a pre-trained
word embedding of the jth word
6/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Notice that s0 is not computed but
am at today ⟨ stop ⟩
I home
just randomly initialized
yt P(yt = j|yt−1 )
1
We learn it along with the other
parameters of RNN (or LSTM or
V V V V V V
GRU)
W W W W W
s0
st We will return back to this later

U U U U U U

xt

<GO> I am at home today

7/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
st = σ(U xt + Wst−1 + b)st s˜t = σ(W(ot ⊙ st−1 ) + Uxt + b) s˜t = σ(W ht−1 + Uxt + b)
ht st = it ⊙ st−1 + (1 − it ) ⊙ s˜t st = ft ⊙ st−1 + it ⊙ s˜t
ht ht ht = ot ⊙ σ(st )

st = RNN( st−1 , xt ) st = GRU( st−1 , xt ) ht , st = LSTM( ht−1 , st−1 , xt )

Before moving on we will see a compact way of writing the function computed
by RNN, GRU and LSTM
We will use these notations going forward

8/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
So far we have seen how to model the
A man throwing ⟨ stop ⟩
conditional probability distribution
. . . yt P(yt = j|yt−1 )
P(yt |yt−1
1 )
1

More informally, we have seen how


V V V V
to generate a sentence given previous
W W W
. . .W st words
s0
What if we want to generate a
U U U U
sentence given an image?
. . . xt We are now interested in P(yt |yt−1
1 , I)
<Go> A man park instead of P(yt |yt−1
1 ) where I is an
image
A man throwing
Notice that P(yt |yt−1
1 , I) is again a
a frisbee in a park
conditional distribution

9/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Earlier we modeled P(yt |yt−1
1 ) as
A man throwing ⟨ stop ⟩

. . . yt P(yt = j|yt−1 P(yt |yt−1


1 ) = P(yt = j|st )
1 , I)

V V V V Where st was a state capturing all the


W W W
previous words
. . .W st
We could now model P(yt = j|yt−1
1 , I)
s0 = fc7 (I)
U U U U as P(yt = j|st , fc7 (I))
. . . xt
where fc7 (I) is the representation
obtained from the fc7 layer of an
<GO> A man park
image
CNN

10/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
There are many ways of making P(yt = j) conditional on fc7 (I)
Let us see two such options

11/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man throwing ⟨ stop ⟩

. . . yt P(yt = j|yt−1
1 , I)

V V V V
Option 1: Set s0 = fc7 (I)
W W W
. . . W
sT Now s0 and hence all subsequent st ’s
s0 = fc7 (I) depend on fc7 (I)
U U U U
We can thus say that P(yt = j)
. . . xT depends on fc7 (I)
<GO> A man park In other words, we are computing
CNN P(yt = j|st , fc7 (I))

Option 1

12/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Option 2: Another more explicit
A man throwing ⟨ stop ⟩
way of doing this is to compute
. . . yt P(yt = j|yt−1
1 , I)

st = RNN(st−1 , [xt , fc7 (I))]


V V V V

In other words we are explicitly using


W W W
. . .W st fc7 (I) to compute st and hence
s0 = fc7 (I) P(yt = j)
U U U U
You could think of other ways of
. . . xt
conditioning P(yt = j) on fc7
<GO> A man park

CNN

Option 2

13/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W
. . .W st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN

14/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W W
st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN
Alternatively, the encoder’s
output can be fed to every step
of the decoder

15/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.2: Applications of Encoder Decoder models

16/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
For all these applications we will try to answer the following questions
What kind of a network can we use to encode the input(s)? (What is an
appropriate encoder?)
What kind of a network can we use to decode the output? (What is an
appropriate decoder?)
What are the parameters of the model ?
What is an appropriate loss function ?

17/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Task: Image captioning
Decoder Lt (θ) = − log P(yt = j|yt−1 , fc7 )
Data: {xi = imagei , yi = captioni }N
1

A man throwing . . . ⟨ stop ⟩ i=1

. . . yt P(yt = j|yt−1
1 , fc7 ) Model:

V V V V
Encoder:
W W W
s0 = CNN(xi )
. . .W st

Encoder Decoder:
U U U U
h0
st = RNN(st−1 , e(ŷt−1 ))
. . . xt

<GO> A man park


P(yt |yt−1
1 , I) = softmax(Vst + b)
CNN
Parameters: Udec , V, Wdec , Wconv , b
Loss:

T ∑
T
L (θ) = Lt (θ) =− log P(yt = ℓt |yt−1
1 , I)
i=1 t=1

Algorithm: Gradient descent with


backpropagation
18/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : The ground is wet
Task: Textual entailment
o/p:The ground is wet <STOP>
Data: {xi = premisei , yi = hypothesisi }N
i=1

Model (Option 1):


Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
The
0
ground
0
is
0
wet
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p:It is raining outside i=1 t=1

i/p : It is raining outside Algorithm: Gradient descent with


backpropagation
19/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : The ground is wet
Task: Textual entailment
o/p:The ground is wet <STOP>
Data: {xi = premisei , yi = hypothesisi }N
i=1

Model (Option 2):


Encoder:
st ht = RNN(ht−1 , xit )
Decoder:
1
0
0
0
0
0
1
0
0
1
s0 = hT (T is length of input)
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 st = RNN(st−1 , [hT , e(ŷt−1 )])
<Go> The ground is wet
P(yt |yt−1
1 , x) = softmax(Vst + b)

ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T

x1 x2 x3 x4
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i=1 t=1
i/p:It is raining outside
Algorithm: Gradient descent with
i/p : It is raining outside backpropagation

20/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : Mein ghar ja raha hoon
Task: Machine translation
o/p:Mein ghar ja raha hoon
Data: {xi = sourcei , yi = targeti }N
i=1

Model (Option 1):


Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
Mein
0
ghar
0
ja
0
raha
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p: I am going home i=1 t=1

i/p : I am going home Algorithm: Gradient descent with


backpropagation
21/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : Mein ghar ja raha hoon
Task: Machine translation
o/p:
Mein ghar ja raha hoon
Data: {xi = sourcei , yi = targeti }N
i=1

Model (Option 2):


Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
Mein
0
ghar
0
ja
0
raha
st = RNN(st−1 , [hT , e(ŷt−1 )])
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p: I am going home i=1 t=1

i/p : I am going home Algorithm: Gradient descent with


backpropagation
22/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : š ' @ i y a
Task: Transliteration
o/p: š ' @ i y a
Data: {xi = srcwordi , yi = tgtwordi }N
i=1

Model (Option 1):


Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0 0
0
0
0
1
0
0
0
0
1
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0 0
0
<Go>
0
š
0
'
0
@
0
i
0
y
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4 x5
1 , x)
i/p: I N D I A i=1 t=1

i/p : I N D I A Algorithm: Gradient descent with


backpropagation
23/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : š ' @ i y a
Task: Transliteration
o/p: š ' @ i y a
Data: {xi = srcwordi , yi = tgtwordi }N
i=1

Model (Option 2):


Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0 0
0
0
0
1
0
0
0
0
1
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0 0
0
<Go>
0
š
0
'
0
@
0
i
0
y
st = RNN(st−1 , [e(ŷt−1 ), hT ])
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4 x5
1 , x)
i/p: I N D I A i=1 t=1

i/p : I N D I A Algorithm: Gradient descent with


backpropagation
24/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
O/p: White
Task: Image Question Answeing
White
Data: {xi = {I, q}i , yi = Answeri }N
i=1

Model:
Encoder:
s
ĥI = CNN(I), h̃t = RNN(h̃t−1 , qit )
h̃t ĥI

s = [h̃T ; ĥI ]
Decoder:
P(y|q, I) = softmax(Vs + b)
What is the bird’s color

Parameters: V, b, Uq , Wq , Wconv , b
Loss:
CNN L (θ) = − log P(y = ℓ|I, q)

Algorithm: Gradient descent with


backpropagation
Question: What
is the bird’s color
25/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : India won
the world cup Task: Document Summarization
o/p:India won the world cup <STOP> Data: {xi = Documenti , yi =
Summaryi }N
i=1

Model:
Encoder:
st

ht = RNN(ht−1 , xit )
1 0 0 1 0 0
Decoder:
0 0 0 0 1 1
0
0
1
0
0
1
0
0
0
0
0
0 s0 = hT
0 0 0 0 0 0
<Go> India won the world cup
st = RNN(st−1 , e(ŷt−1 ))
c
P(yt |yt−1
1 , x) = softmax(Vst + b)
. . . . . . ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
. . . . . . ∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i/p:India beats . . . . . . Srilanka
i=1 t=1
i/p : India beats Srilanka to win ICC WC 2011.
Algorithm: Gradient descent with
Dhoni and Gambhir’s half centuries help beat SL
backpropagation 26/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man walking on a rope
o/p : A man walking on a rope Task: Video Captioning
Data: {xi = videoi , yi = desci }N
i=1

Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
Decoder:
s0 = hT
. . . st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)

. . . Parameters: Udec , Wdec , V, b, Wconv , Uenc ,


Wenc , b
Loss:
CNN CNN . . . CNN

T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i=1 t=1
. . .
Algorithm: Gradient descent with
backpropagation 27/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: Surya Namaskar
Task: Video Classification
Suryanamaskar
Data: {xi = Videoi , yi = Activityi }N
i=1

Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
. . .
Decoder:
s = hT
. . . P(y|I) = softmax(Vs + b)
Parameters: V, b, Wconv , Uenc , Wenc , b
CNN CNN . . . CNN
Loss:
L (θ) = − log P(y = ℓ|Video)
. . .
Algorithm: Gradient descent with
backpropagation

28/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am fine <STOP>
o/p: I am fine Task: Dialog
Data: {xi = Utterancei , yi =
Responsei }N
i=1

Model:
st
Encoder:

1 0 0 1
ht = RNN(ht−1 , xit )
0
0
0
1
0
0
0
0
Decoder:
0 0 1 0
0
<Go>
0
I
0
am
0
fine
s0 = hT (T is length of input)

c
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht

Parameters: Udec , V, Wdec , Uenc , Wenc , b


Loss:
x1 x2 x3 ∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i/p:How are you i=1 t=1
Algorithm: Gradient descent with
i/p: How are you backpropagation 29/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
And the list continues ...
Try picking a problem from your domain and see if you can model it using the
encoder decoder paradigm
Encoder decoder models can be made even more expressive by adding an
“attention” mechanism
We will first motivate the need for this and then explain how to model it

30/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.3: Attention Mechanism

31/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us motivate the task of attention with
o/p : I am going home the help of MT
o/p: I am going home <STOP> The encoder reads the sentences only once
and encodes it
At each timestep the decoder uses this
si
embedding to produce a new word
Is this how humans translate a sentence ?
1
0
0
0
0
0
1
0
0
1
Not really!
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home

hi

x1 x2 x3 x4 x5

i/p:
Main ghar ja raha hoon

i/p : Main ghar ja raha hoon 32/63


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Humans try to produce each word in
o/p : I am going home the output by focusing only on certain
t1 : [ 1 0 0 0 0 ] words in the input
t2 : [ 0 0 0 0 1 ] Essentially at each time step we come
t3 : [ 0 0 0.5 0.5 0 ] up with a distribution on the input
t4 : [ 0 1 0 0 0 ] words
This distribution tells us how much
i/p : Main ghar ja raha hoon attention to pay to each input words
at each time step
Ideally, at each time-step we should
feed only this relevant information
(i.e. encodings of relevant words) to
the decoder

33/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us revisit the decoder that we have
o/p : I am going home seen so far
o/p: I am going home <STOP> We either feed in the encoder information
only once(at s0 )
Or we feed the same encoder information
si
at each time step
Now suppose an oracle told you which
1
0
0
0
0
0
1
0
0
1
words to focus on at a given time-step t
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 Can you think of a smarter way of feeding
<Go> I am going home
information to the decoder?
c

hi

x1 x2 x3 x4 x5

i/p:
Main ghar ja raha hoon

i/p : Main ghar ja raha hoon 34/63


Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We could just take a weighted average
o/p: I am going home <STOP> of the corresponding word representations
and feed it to the decoder
For example at timestep 3, we can
just take a weighted average of the
representations of ‘ja’ and ‘raha’
1
0
0
0
0
0
1
0
0
1
Intuitively this should work better
0 0 0 0
0
0
1
0
0
1
0
0
0
0
0
because we are not overloading the
<Go> I am going home
decoder with irrelevant information
+ c2 + c3 + c4 + c5 (about words that do not matter at this
α1,2
1,3 α1,5
2,3 α2,4
2,2
1,4 3,2 α3,4
3,3 α2,5 α3,5
4,2 4,3 α4,5 α5,4
5,2
4,4 5,3 α5,5
time step)
How do we convert this intuition into a
hi
model ?

x1 x2 x3 x4 x5

i/p:Main ghar ja raha hoon


35/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Of course in practice we will not have this
o/p: I am going home <STOP> oracle
The machine will have to learn this from
the data
To enable this we define a function

1 0 0 1 0 ejt = fATT (st−1 , cj )


0 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0
<Go>
0
I
0
am
0
going
0
home
This quantity captures the importance of
the jth input word for decoding the tth
+ ct
output word (we will see the exact form
α1,2 α2,2 α3,2 α4,2 α5,2
of fATT later)
hi
We can normalize these weights by using
the softmax function
exp(ejt )
αjt = M
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon
exp(ejt )
j=1 36/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am going home <STOP> exp(ejt )
αjt =

M
exp(ejt )
j=1

αjt denotes the probability of focusing on


1
0
0
0
0
0
1
0
0
1 the jth word to produce the tth output
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 word
<Go> I am going home
We are now trying to learn the α’s instead
ct
+
of an oracle informing us about the α’s
α1,2 α2,2 α3,2 α4,2 α5,2
Learning would always involve some
hi parameters
So let’s define a parametric form for α’s
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon


37/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
From now on we will refer to the decoder
o/p: I am going home <STOP> RNN’s state at the t-th timestep as st and
the encoder RNN’s state at the j-th time
step as cj
Given these new notations, one (among
many) possible choice for fATT is
1 0 0 1 0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
ejt = VT
att tanh(Uatt st−1 + Watt cj )
0 0 0 0 0
<Go> I am going home
Vatt ∈ Rd , Uatt ∈ Rd×d , Watt ∈ Rd×d are
ct
+
additional parameters of the model
α1,2 α2,2 α3,2 α4,2 α5,2
These parameters will be learned along
hi with the other parameters of the encoder
and decoder
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon


38/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Wait a minute !
o/p: I am going home <STOP>
This model would make a lot of sense if
were given the true α’s at training time

αtjtrue = [0, 0, 0.5, 0.5, 0]

αtjpred = [0.1, 0.1, 0.35, 0.35, 0.1]


1 0 0 1 0
0 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home
We could then minimize L (αtrue , αpred )
+ ct
in addition to L (θ) as defined earlier
α1,2 α2,2 α3,2 α4,2 α5,2
But in practice it is very hard to get αtrue
hi

x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon


39/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
For example, in our translation example
o/p: I am going home <STOP> we would want someone to manually
annotate the source words which
contribute to every target word
It is hard to get such annotated data
Then how would this model work in the
1
0
0
0
0
0
1
0
0
1
absence of such data ?
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home

+ ct

α1,2 α2,2 α3,2 α4,2 α5,2

hi

x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon


40/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
It works because it is a better modeling
o/p: I am going home <STOP> choice
This is a more informed model
We are essentially asking the model to
approach the problem in a better (more
natural) way
1
0
0
0
0
0
1
0
0
1 Given enough data it should be able
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 to learn these attention weights just as
<Go> I am going home
humans do
ct
+
That’s the hope (and hope is a good
α1,2 α2,2 α3,2 α4,2 α5,2
thing)
hi And in practice indeed these models work
better than the vanilla encoder decoder
models
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon


41/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us revisit the MT model that we saw earlier and answer the same set of
questions again (data, encoder, decoder, loss, training algorithm)

42/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am going home <STOP>
Task: Machine Translation
Data: {xi = sourcei , yi = targeti }N
i=1

Encoder:
ht = RNN(ht−1 , xt )
s0 = hT
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0
Decoder:
0 0 1 0 0
0 0 0 0 0 ejt = VT
attn tanh(Uattn hj + Wattn st )
<Go> I am going home
αjt = softmax(ejt )
+ c2 + c3 + c4 + c5

T

α1,2
1,3 α1,5
2,3 α2,4
2,2
1,4 3,2 α3,4
3,3 α2,5 α3,5
4,2 4,3 α4,5 α5,4
5,2
4,4 5,3 α5,5 ct = αjt hj
j=1
hi
st = RNN(st−1 , [e(ŷt−1 ), ct ])
ℓt = softmax(Vst + b)
x1 x2 x3 x4 x5
Parameters: Udec , V, Wdec , Uenc , Wenc , b,
i/p:Main ghar ja raha hoon Uattn , Vattn
Loss and Algorithm remains same
43/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
You can try adding an attention component to all the other encoder decoder
models that we discussed earlier and answer the same set of questions (data,
encoder, decoder, loss, training algorithm)

44/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Can we check if the attention model actually learns something meaningful ?
In other words does it really learn to focus on the most relevant words in the
input at the t-th timestep ?
We can check this by plotting the attention weights as a heatmap (we will see
some examples on the next slide)

45/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based Figure: Example output of attention-based
summarization system [Rush et al. 2015.] neural machine translation model [Cho et al.
2015].

The heat map shows a soft alignment between the input and the generated
output.
Each cell in the heat mapsssss corresponds to αtj (i.e., the importance of the
jth input word for predicting the tth output word as determined by the model)
46/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based video captioning system [Yao et al. 2015.] 47/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.4: Attention over images

48/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
A man throwing for images?
a frisbee in a park

49/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
o/p:
main ghar ja raha hoon <STOP> for images?
In the case of text we have a
representation for every location (time
hi
step) of the input sequence

1 0 0 1 0 0
0 0 0 0 1 1
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
<Go> main ghar ja raha hoon

+ ct

α1 α2 α3 α4

hi

x1 x2 x3 x4

i/p: I am going home


50/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
Encoder
for images?
h0
In the case of text we have a
representation for every location (time
step) of the input sequence
CNN
But for images we typically use
representation from one of the fully
connected layers
This representation does not contain any
location information
So then what is the input to the attention
mechanism?

51/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map

softmax
4

2
22

22

11

11

56

56

28

28

14
14
7

28

14

14

7
28
56

56
112

112

512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

52/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map
+ We could think of this as 196 locations
(each having a 512 dimensional
αt1 αt196 representation)
512 The model will then learn an attention
… over these locations (which in turn
1 2 196 correspond to actual locations in the
images)
14

512

14

53/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at some examples of attention over images for the task of image
captioning

54/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Examples of the attention-based model attending to the correct object (white
indicates the attended regions,underlines indicates the corresponding word) [Kyunghyun
Cho et al. 2015.]

55/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.5: Hierarchical Attention

56/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Consider a dialog between a user (u)
and a bot (B)
Context
The dialog contains a sequence of
U: Can you suggest a good movie? utterances between the user and the
B: Yes, sure. How about Logan? bot
U: Okay, who is the lead actor? Each utterance in turn is a sequence
of words
Response Thus what we have here is a
B: Hugh Jackman, of course “sequence of sequences” as input
Can you think of an encoder for such
a sequence of sequences?

57/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p:Hugh Jackman of course <STOP>
We could think of a two level
hierarchical RNN encoder
The first level RNN operates on the
sequence of words in each utterance
and gives us a representation
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
We now have a sequence of utterance
0
0
<Go>
0
0
I
1
0
am
0
0
going
0
0
home
representations (red vectors in the
image)
We can now have another RNN
which encodes this sequence and
gives a single representations for the
sequences of utterances
The decoder can then produce an
… … … output sequence conditioned on this
Can you movie? Yes sure Logan? Okay who actor? utterance
58/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at another example
Politics is the process of making decisions
applying to all members of each group. Consider the task of document
More narrowly, it refers to achieving and … classification or summarization
A document is a sequence of sentences
Politics

Each sentence in turn is a sequence of


words
We can again use a hierarchical RNN
to model this

… … …
Politics is decisions applying to group More narrowlyand

59/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Politics is the process of making decisions
Word level (1) encoder:
applying to all members of each group.
More narrowly, it refers to achieving and … h1ij = RNN(h1ij−1 , wij )
si = h1iTi [T is length of sentence i]
Politics

Sentence level (2) encoder:


h2i = RNN(h2i−1 , si )
s = h2K [K is number of sentences]

Decoder:
P(y|document) = softmax(Vs + b)

Params: W1enc , U1enc , W2enc , U2enc , V, b


… … … Loss: Cross Entropy
Politics is decisions applying to group More narrowlyand Algorithm: Gradient Descent with
backpropagation

60/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How would you model attention in
such a hierarchical encoder decoder
model ?
We need attention at two levels
First we need to attend to important
(most informative) words in a
sentence
Then we need to attend to important
(most informative) sentences in a
document
Let us see how to model this

Figure: Hierarchical Attention Network


[Yang et al.]

61/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Word level (1) encoder:
hij = RNN(hij−1 , wij )
uij = tanh(Ww hij + bw )
exp(uT ij uw )
αij = ∑ T
t exp(uit uw )

si = αij hij
j

Sentence level (2) encoder:


hi = RNN(hi−1 , si )
ui = tanh(Ws hi + bs )
exp(uT i us )
αi = ∑ Tu )
i exp(u i s

Figure: Hierarchical Attention Network s= αi h i
[Yang et al.] i

62/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Decoder:

P(y|document) = softmax(Vs + b)

Parameters:
Ww , Ws , V, bw , bs , b, uw , us
Loss: cross entropy
Algorithm: Gradient Descent and
backpropagation

Figure: Hierarchical Attention Network


[Yang et al..]

63/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

You might also like