0% found this document useful (0 votes)

72 views63 pages

Encoder-Decoder Models Overview

The document discusses Encoder-Decoder models and the Attention Mechanism in deep learning, focusing on language modeling and image captioning. It explains how to model conditional probabilities using Recurrent Neural Networks (RNNs) and introduces the architecture for generating sentences from images. Additionally, it outlines applications of these models, including tasks like image captioning and textual entailment, while emphasizing the importance of encoder and decoder networks.

Uploaded by

YASWANTH P 717822I163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views63 pages

Encoder-Decoder Models Overview

Uploaded by

YASWANTH P 717822I163

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS7015 (Deep Learning) : Lecture 16

Encoder Decoder Models, Attention Mechanism

Mitesh M. Khapra

Department of Computer Science and Engineering

Indian Institute of Technology Madras

1/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.1: Introduction to Encoder Decoder Models

2/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We will start by revisiting the
am at today ⟨ stop ⟩
I home
problem of language modeling
yt yt yt yt yt yt P(yt = j|yt−1 )
1
Informally, given ‘t − i’ words we are
interested in predicting the tth word
V V V V V V
More formally, given y1 , y2 , ..., yt−1 we
W W W W W
s0
st want to find

U U U U U U y∗ = argmax P(yt |y1 , y2 , ..., yt−1 )

xt
Let us see how we model
am
<GO> I at home today
P(yt |y1 , y2 ...yt−1 ) using a RNN
We will refer to P(yt |y1 , y2 ...yt−1 ) by
shorthand notation: P(yt |yt−1 1 )

3/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We are interested in
I am at home today ⟨ stop ⟩

yt P(yt = j|yt−1
1 ) P(yt = j|y1 , y2 ...yt−1 )

V V V V V V where j ∈ V and V is the set of all

vocabulary words
W W W W W
st
s0 Using an RNN we compute this as
U U U U U U
P(yt = j|yt−1
1 ) = softmax(Vst + c)j
xt

<GO> I am at home today In other words we compute

P(yt = j|yt−1
1 ) = P(yt = j|st )
= softmax(Vst + c)j

Notice that the recurrent connections

ensure that st has information about
yt−1
1 4/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: All sentences from any large
I am at home today ⟨ stop ⟩
corpus (say wikipedia)
yt P(yt = j|yt−1 )
1
Model:
V V V V V V
st = σ(Wst−1 + Uxt + b)
t−1
W W W W W
st P(yt = j|y1 ) = softmax(Vst + c)j
s0

U U U U U U Parameters: U, V, W, b, c
xt Loss:
<GO> I am at home today
∑
T
L (θ) = Lt (θ)
t=1
India, officially the Republic Lt (θ) = − log P(yt = ℓt |yt−1
1 )
of India, is a country in South
Data: where ℓt is the true word at time step
Asia. It is the seventh-largest
country by area, ..... t
5/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
What is the input at each time step?
o/p: I am at home today <stop>
It is simply the word that we
predicted at the previous time step
In general
st st = RNN(st−1 , xt )

1 0 0 1 0 0 Let j be the index of the word

0 0 0 0 1 0
0
0
1
0
0
1
0
0
0
0
0
1 which has been assigned the max
0 0 0 0 0 0
<GO> probability at time step t − 1
xt = e(vj )

xt is essentially a one-hot vector

(e(vj ))representing the jth word in the
vocabulary
In practice, instead of one hot
representation we use a pre-trained
word embedding of the jth word
6/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Notice that s0 is not computed but
am at today ⟨ stop ⟩
I home
just randomly initialized
yt P(yt = j|yt−1 )
1
We learn it along with the other
parameters of RNN (or LSTM or
V V V V V V
GRU)
W W W W W
s0
st We will return back to this later

U U U U U U

<GO> I am at home today

7/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
st = σ(U xt + Wst−1 + b)st s˜t = σ(W(ot ⊙ st−1 ) + Uxt + b) s˜t = σ(W ht−1 + Uxt + b)
ht st = it ⊙ st−1 + (1 − it ) ⊙ s˜t st = ft ⊙ st−1 + it ⊙ s˜t
ht ht ht = ot ⊙ σ(st )

st = RNN( st−1 , xt ) st = GRU( st−1 , xt ) ht , st = LSTM( ht−1 , st−1 , xt )

Before moving on we will see a compact way of writing the function computed
by RNN, GRU and LSTM
We will use these notations going forward

8/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
So far we have seen how to model the
A man throwing ⟨ stop ⟩
conditional probability distribution
. . . yt P(yt = j|yt−1 )
P(yt |yt−1
1 )
1

More informally, we have seen how

V V V V
to generate a sentence given previous
W W W
. . .W st words
s0
What if we want to generate a
U U U U
sentence given an image?
. . . xt We are now interested in P(yt |yt−1
1 , I)
<Go> A man park instead of P(yt |yt−1
1 ) where I is an
image
A man throwing
Notice that P(yt |yt−1
1 , I) is again a
a frisbee in a park
conditional distribution

9/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Earlier we modeled P(yt |yt−1
1 ) as
A man throwing ⟨ stop ⟩

. . . yt P(yt = j|yt−1 P(yt |yt−1

1 ) = P(yt = j|st )
1 , I)

V V V V Where st was a state capturing all the

W W W
previous words
. . .W st
We could now model P(yt = j|yt−1
1 , I)
s0 = fc7 (I)
U U U U as P(yt = j|st , fc7 (I))
. . . xt
where fc7 (I) is the representation
obtained from the fc7 layer of an
<GO> A man park
image
CNN

10/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
There are many ways of making P(yt = j) conditional on fc7 (I)
Let us see two such options

11/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man throwing ⟨ stop ⟩

. . . yt P(yt = j|yt−1
1 , I)

V V V V
Option 1: Set s0 = fc7 (I)
W W W
. . . W
sT Now s0 and hence all subsequent st ’s
s0 = fc7 (I) depend on fc7 (I)
U U U U
We can thus say that P(yt = j)
. . . xT depends on fc7 (I)
<GO> A man park In other words, we are computing
CNN P(yt = j|st , fc7 (I))

Option 1

12/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Option 2: Another more explicit
A man throwing ⟨ stop ⟩
way of doing this is to compute
. . . yt P(yt = j|yt−1
1 , I)

st = RNN(st−1 , [xt , fc7 (I))]

V V V V

In other words we are explicitly using

W W W
. . .W st fc7 (I) to compute st and hence
s0 = fc7 (I) P(yt = j)
U U U U
You could think of other ways of
. . . xt
conditioning P(yt = j) on fc7
<GO> A man park

CNN

Option 2

13/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W
. . .W st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN

14/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at the full
Decoder
architecture
A man throwing ⟨ stop ⟩
A CNN is first used to encode
. . . yt P(yt = j|yt−1 , I)
1
the image
V V V V A RNN is then used to decode
(generate) a sentence from the
W W W W
st encoding
Encoder This is a typical encoder
U U U U
h0 decoder architecture
. . . xt
Both the encoder and decoder
<GO> A man park
use a neural network
CNN
Alternatively, the encoder’s
output can be fed to every step
of the decoder

15/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.2: Applications of Encoder Decoder models

16/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
For all these applications we will try to answer the following questions
What kind of a network can we use to encode the input(s)? (What is an
appropriate encoder?)
What kind of a network can we use to decode the output? (What is an
appropriate decoder?)
What are the parameters of the model ?
What is an appropriate loss function ?

17/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Task: Image captioning
Decoder Lt (θ) = − log P(yt = j|yt−1 , fc7 )
Data: {xi = imagei , yi = captioni }N
1

A man throwing . . . ⟨ stop ⟩ i=1

. . . yt P(yt = j|yt−1
1 , fc7 ) Model:

V V V V
Encoder:
W W W
s0 = CNN(xi )
. . .W st

Encoder Decoder:
U U U U
h0
st = RNN(st−1 , e(ŷt−1 ))
. . . xt

<GO> A man park

P(yt |yt−1
1 , I) = softmax(Vst + b)
CNN
Parameters: Udec , V, Wdec , Wconv , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) =− log P(yt = ℓt |yt−1
1 , I)
i=1 t=1

Algorithm: Gradient descent with

backpropagation
18/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : The ground is wet
Task: Textual entailment
o/p:The ground is wet <STOP>
Data: {xi = premisei , yi = hypothesisi }N
i=1

Model (Option 1):

Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
The
0
ground
0
is
0
wet
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p:It is raining outside i=1 t=1

i/p : It is raining outside Algorithm: Gradient descent with

backpropagation
19/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : The ground is wet
Task: Textual entailment
o/p:The ground is wet <STOP>
Data: {xi = premisei , yi = hypothesisi }N
i=1

Model (Option 2):

Encoder:
st ht = RNN(ht−1 , xit )
Decoder:
1
0
0
0
0
0
1
0
0
1
s0 = hT (T is length of input)
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 st = RNN(st−1 , [hT , e(ŷt−1 )])
<Go> The ground is wet
P(yt |yt−1
1 , x) = softmax(Vst + b)

ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T

x1 x2 x3 x4
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i=1 t=1
i/p:It is raining outside
Algorithm: Gradient descent with
i/p : It is raining outside backpropagation

20/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : Mein ghar ja raha hoon
Task: Machine translation
o/p:Mein ghar ja raha hoon
Data: {xi = sourcei , yi = targeti }N
i=1

Model (Option 1):

Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
Mein
0
ghar
0
ja
0
raha
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p: I am going home i=1 t=1

i/p : I am going home Algorithm: Gradient descent with

backpropagation
21/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : Mein ghar ja raha hoon
Task: Machine translation
o/p:
Mein ghar ja raha hoon
Data: {xi = sourcei , yi = targeti }N
i=1

Model (Option 2):

Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0
0
<Go>
0
Mein
0
ghar
0
ja
0
raha
st = RNN(st−1 , [hT , e(ŷt−1 )])
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4
1 , x)
i/p: I am going home i=1 t=1

i/p : I am going home Algorithm: Gradient descent with

backpropagation
22/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : š ' @ i y a
Task: Transliteration
o/p: š ' @ i y a
Data: {xi = srcwordi , yi = tgtwordi }N
i=1

Model (Option 1):

Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0 0
0
0
0
1
0
0
0
0
1
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0 0
0
<Go>
0
š
0
'
0
@
0
i
0
y
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4 x5
1 , x)
i/p: I N D I A i=1 t=1

i/p : I N D I A Algorithm: Gradient descent with

backpropagation
23/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : š ' @ i y a
Task: Transliteration
o/p: š ' @ i y a
Data: {xi = srcwordi , yi = tgtwordi }N
i=1

Model (Option 2):

Encoder:
st
ht = RNN(ht−1 , xit )
Decoder:
1 0 0 1 0 0
0
0
0
1
0
0
0
0
1
0
1
0 s0 = hT (T is length of input)
0 0 1 0 0 0
0
<Go>
0
š
0
'
0
@
0
i
0
y
st = RNN(st−1 , [e(ŷt−1 ), hT ])
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
x1 x2 x3 x4 x5
1 , x)
i/p: I N D I A i=1 t=1

i/p : I N D I A Algorithm: Gradient descent with

backpropagation
24/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
O/p: White
Task: Image Question Answeing
White
Data: {xi = {I, q}i , yi = Answeri }N
i=1

Model:
Encoder:
s
ĥI = CNN(I), h̃t = RNN(h̃t−1 , qit )
h̃t ĥI

s = [h̃T ; ĥI ]
Decoder:
P(y|q, I) = softmax(Vs + b)
What is the bird’s color

Parameters: V, b, Uq , Wq , Wconv , b
Loss:
CNN L (θ) = − log P(y = ℓ|I, q)

Algorithm: Gradient descent with

backpropagation
Question: What
is the bird’s color
25/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p : India won
the world cup Task: Document Summarization
o/p:India won the world cup <STOP> Data: {xi = Documenti , yi =
Summaryi }N
i=1

Model:
Encoder:
st

ht = RNN(ht−1 , xit )
1 0 0 1 0 0
Decoder:
0 0 0 0 1 1
0
0
1
0
0
1
0
0
0
0
0
0 s0 = hT
0 0 0 0 0 0
<Go> India won the world cup
st = RNN(st−1 , e(ŷt−1 ))
c
P(yt |yt−1
1 , x) = softmax(Vst + b)
. . . . . . ht
Parameters: Udec , V, Wdec , Uenc , Wenc , b
Loss:
. . . . . . ∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i/p:India beats . . . . . . Srilanka
i=1 t=1
i/p : India beats Srilanka to win ICC WC 2011.
Algorithm: Gradient descent with
Dhoni and Gambhir’s half centuries help beat SL
backpropagation 26/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
A man walking on a rope
o/p : A man walking on a rope Task: Video Captioning
Data: {xi = videoi , yi = desci }N
i=1

Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
Decoder:
s0 = hT
. . . st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)

. . . Parameters: Udec , Wdec , V, b, Wconv , Uenc ,

Wenc , b
Loss:
CNN CNN . . . CNN
∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i=1 t=1
. . .
Algorithm: Gradient descent with
backpropagation 27/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: Surya Namaskar
Task: Video Classification
Suryanamaskar
Data: {xi = Videoi , yi = Activityi }N
i=1

Model:
Encoder:
ht = RNN(ht−1 , CNN(xit ))
. . .
Decoder:
s = hT
. . . P(y|I) = softmax(Vs + b)
Parameters: V, b, Wconv , Uenc , Wenc , b
CNN CNN . . . CNN
Loss:
L (θ) = − log P(y = ℓ|Video)
. . .
Algorithm: Gradient descent with
backpropagation

28/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am fine <STOP>
o/p: I am fine Task: Dialog
Data: {xi = Utterancei , yi =
Responsei }N
i=1

Model:
st
Encoder:

1 0 0 1
ht = RNN(ht−1 , xit )
0
0
0
1
0
0
0
0
Decoder:
0 0 1 0
0
<Go>
0
I
0
am
0
fine
s0 = hT (T is length of input)

c
st = RNN(st−1 , e(ŷt−1 ))
P(yt |yt−1
1 , x) = softmax(Vst + b)
ht

Parameters: Udec , V, Wdec , Uenc , Wenc , b

Loss:
x1 x2 x3 ∑
T ∑
T
L (θ) = Lt (θ) = − log P(yt = ℓt |yt−1
1 , x)
i/p:How are you i=1 t=1
Algorithm: Gradient descent with
i/p: How are you backpropagation 29/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
And the list continues ...
Try picking a problem from your domain and see if you can model it using the
encoder decoder paradigm
Encoder decoder models can be made even more expressive by adding an
“attention” mechanism
We will first motivate the need for this and then explain how to model it

30/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.3: Attention Mechanism

31/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us motivate the task of attention with
o/p : I am going home the help of MT
o/p: I am going home <STOP> The encoder reads the sentences only once
and encodes it
At each timestep the decoder uses this
si
embedding to produce a new word
Is this how humans translate a sentence ?
1
0
0
0
0
0
1
0
0
1
Not really!
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home

x1 x2 x3 x4 x5

i/p:
Main ghar ja raha hoon

i/p : Main ghar ja raha hoon 32/63

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Humans try to produce each word in
o/p : I am going home the output by focusing only on certain
t1 : [ 1 0 0 0 0 ] words in the input
t2 : [ 0 0 0 0 1 ] Essentially at each time step we come
t3 : [ 0 0 0.5 0.5 0 ] up with a distribution on the input
t4 : [ 0 1 0 0 0 ] words
This distribution tells us how much
i/p : Main ghar ja raha hoon attention to pay to each input words
at each time step
Ideally, at each time-step we should
feed only this relevant information
(i.e. encodings of relevant words) to
the decoder

33/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us revisit the decoder that we have
o/p : I am going home seen so far
o/p: I am going home <STOP> We either feed in the encoder information
only once(at s0 )
Or we feed the same encoder information
si
at each time step
Now suppose an oracle told you which
1
0
0
0
0
0
1
0
0
1
words to focus on at a given time-step t
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 Can you think of a smarter way of feeding
<Go> I am going home
information to the decoder?
c

x1 x2 x3 x4 x5

i/p:
Main ghar ja raha hoon

i/p : Main ghar ja raha hoon 34/63

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
We could just take a weighted average
o/p: I am going home <STOP> of the corresponding word representations
and feed it to the decoder
For example at timestep 3, we can
just take a weighted average of the
representations of ‘ja’ and ‘raha’
1
0
0
0
0
0
1
0
0
1
Intuitively this should work better
0 0 0 0
0
0
1
0
0
1
0
0
0
0
0
because we are not overloading the
<Go> I am going home
decoder with irrelevant information
+ c2 + c3 + c4 + c5 (about words that do not matter at this
α1,2
1,3 α1,5
2,3 α2,4
2,2
1,4 3,2 α3,4
3,3 α2,5 α3,5
4,2 4,3 α4,5 α5,4
5,2
4,4 5,3 α5,5
time step)
How do we convert this intuition into a
hi
model ?

x1 x2 x3 x4 x5

i/p:Main ghar ja raha hoon

35/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Of course in practice we will not have this
o/p: I am going home <STOP> oracle
The machine will have to learn this from
the data
To enable this we define a function

1 0 0 1 0 ejt = fATT (st−1 , cj )

0 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0
<Go>
0
I
0
am
0
going
0
home
This quantity captures the importance of
the jth input word for decoding the tth
+ ct
output word (we will see the exact form
α1,2 α2,2 α3,2 α4,2 α5,2
of fATT later)
hi
We can normalize these weights by using
the softmax function
exp(ejt )
αjt = M
x1 x2 x3 x4 x4
∑
i/p:Main ghar ja raha hoon
exp(ejt )
j=1 36/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am going home <STOP> exp(ejt )
αjt =
∑
M
exp(ejt )
j=1

αjt denotes the probability of focusing on

1
0
0
0
0
0
1
0
0
1 the jth word to produce the tth output
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 word
<Go> I am going home
We are now trying to learn the α’s instead
ct
+
of an oracle informing us about the α’s
α1,2 α2,2 α3,2 α4,2 α5,2
Learning would always involve some
hi parameters
So let’s define a parametric form for α’s
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon

37/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
From now on we will refer to the decoder
o/p: I am going home <STOP> RNN’s state at the t-th timestep as st and
the encoder RNN’s state at the j-th time
step as cj
Given these new notations, one (among
many) possible choice for fATT is
1 0 0 1 0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
ejt = VT
att tanh(Uatt st−1 + Watt cj )
0 0 0 0 0
<Go> I am going home
Vatt ∈ Rd , Uatt ∈ Rd×d , Watt ∈ Rd×d are
ct
+
additional parameters of the model
α1,2 α2,2 α3,2 α4,2 α5,2
These parameters will be learned along
hi with the other parameters of the encoder
and decoder
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon

38/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Wait a minute !
o/p: I am going home <STOP>
This model would make a lot of sense if
were given the true α’s at training time

αtjtrue = [0, 0, 0.5, 0.5, 0]

αtjpred = [0.1, 0.1, 0.35, 0.35, 0.1]

1 0 0 1 0
0 0 0 0 1
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home
We could then minimize L (αtrue , αpred )
+ ct
in addition to L (θ) as defined earlier
α1,2 α2,2 α3,2 α4,2 α5,2
But in practice it is very hard to get αtrue
hi

x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon

39/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
For example, in our translation example
o/p: I am going home <STOP> we would want someone to manually
annotate the source words which
contribute to every target word
It is hard to get such annotated data
Then how would this model work in the
1
0
0
0
0
0
1
0
0
1
absence of such data ?
0 1 0 0 0
0 0 1 0 0
0 0 0 0 0
<Go> I am going home

+ ct

α1,2 α2,2 α3,2 α4,2 α5,2

x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon

40/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
It works because it is a better modeling
o/p: I am going home <STOP> choice
This is a more informed model
We are essentially asking the model to
approach the problem in a better (more
natural) way
1
0
0
0
0
0
1
0
0
1 Given enough data it should be able
0 1 0 0 0
0
0
0
0
1
0
0
0
0
0 to learn these attention weights just as
<Go> I am going home
humans do
ct
+
That’s the hope (and hope is a good
α1,2 α2,2 α3,2 α4,2 α5,2
thing)
hi And in practice indeed these models work
better than the vanilla encoder decoder
models
x1 x2 x3 x4 x4

i/p:Main ghar ja raha hoon

41/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us revisit the MT model that we saw earlier and answer the same set of
questions again (data, encoder, decoder, loss, training algorithm)

42/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p: I am going home <STOP>
Task: Machine Translation
Data: {xi = sourcei , yi = targeti }N
i=1

Encoder:
ht = RNN(ht−1 , xt )
s0 = hT
1 0 0 1 0
0
0
0
1
0
0
0
0
1
0
Decoder:
0 0 1 0 0
0 0 0 0 0 ejt = VT
attn tanh(Uattn hj + Wattn st )
<Go> I am going home
αjt = softmax(ejt )
+ c2 + c3 + c4 + c5
∑
T

α1,2
1,3 α1,5
2,3 α2,4
2,2
1,4 3,2 α3,4
3,3 α2,5 α3,5
4,2 4,3 α4,5 α5,4
5,2
4,4 5,3 α5,5 ct = αjt hj
j=1
hi
st = RNN(st−1 , [e(ŷt−1 ), ct ])
ℓt = softmax(Vst + b)
x1 x2 x3 x4 x5
Parameters: Udec , V, Wdec , Uenc , Wenc , b,
i/p:Main ghar ja raha hoon Uattn , Vattn
Loss and Algorithm remains same
43/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
You can try adding an attention component to all the other encoder decoder
models that we discussed earlier and answer the same set of questions (data,
encoder, decoder, loss, training algorithm)

44/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Can we check if the attention model actually learns something meaningful ?
In other words does it really learn to focus on the most relevant words in the
input at the t-th timestep ?
We can check this by plotting the attention weights as a heatmap (we will see
some examples on the next slide)

45/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based Figure: Example output of attention-based
summarization system [Rush et al. 2015.] neural machine translation model [Cho et al.
2015].

The heat map shows a soft alignment between the input and the generated
output.
Each cell in the heat mapsssss corresponds to αtj (i.e., the importance of the
jth input word for predicting the tth output word as determined by the model)
46/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Example output of attention-based video captioning system [Yao et al. 2015.] 47/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.4: Attention over images

48/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
A man throwing for images?
a frisbee in a park

49/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
o/p:
main ghar ja raha hoon <STOP> for images?
In the case of text we have a
representation for every location (time
hi
step) of the input sequence

1 0 0 1 0 0
0 0 0 0 1 1
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 0 0
<Go> main ghar ja raha hoon

+ ct

α1 α2 α3 α4

x1 x2 x3 x4

i/p: I am going home

50/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How do we model an attention mechanism
Encoder
for images?
h0
In the case of text we have a
representation for every location (time
step) of the input sequence
CNN
But for images we typically use
representation from one of the fully
connected layers
This representation does not contain any
location information
So then what is the input to the attention
mechanism?

51/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map

softmax
4

2
22

14
14
7

7
28
56

56
112

112

512
224

224

512 512
256 512
128 256
maxpool Conv maxpool Conv maxpool
64 128 maxpool Conv
64 maxpool Conv
1000
Input Conv fc fc
4096 4096

52/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Well, instead of the fc7 representation we
use the output of one of the convolution
layers which has spatial information
For example the output of the 5th
convolutional layer of VGGNet is a 14 ×
14 × 512 size feature map
+ We could think of this as 196 locations
(each having a 512 dimensional
αt1 αt196 representation)
512 The model will then learn an attention
… over these locations (which in turn
1 2 196 correspond to actual locations in the
images)
14

512

53/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at some examples of attention over images for the task of image
captioning

54/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Figure: Examples of the attention-based model attending to the correct object (white
indicates the attended regions,underlines indicates the corresponding word) [Kyunghyun
Cho et al. 2015.]

55/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Module 16.5: Hierarchical Attention

56/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Consider a dialog between a user (u)
and a bot (B)
Context
The dialog contains a sequence of
U: Can you suggest a good movie? utterances between the user and the
B: Yes, sure. How about Logan? bot
U: Okay, who is the lead actor? Each utterance in turn is a sequence
of words
Response Thus what we have here is a
B: Hugh Jackman, of course “sequence of sequences” as input
Can you think of an encoder for such
a sequence of sequences?

57/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
o/p:Hugh Jackman of course <STOP>
We could think of a two level
hierarchical RNN encoder
The first level RNN operates on the
sequence of words in each utterance
and gives us a representation
1
0
0
0
0
1
0
0
0
1
0
0
0
1
0
We now have a sequence of utterance
0
0
<Go>
0
0
I
1
0
am
0
0
going
0
0
home
representations (red vectors in the
image)
We can now have another RNN
which encodes this sequence and
gives a single representations for the
sequences of utterances
The decoder can then produce an
… … … output sequence conditioned on this
Can you movie? Yes sure Logan? Okay who actor? utterance
58/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Let us look at another example
Politics is the process of making decisions
applying to all members of each group. Consider the task of document
More narrowly, it refers to achieving and … classification or summarization
A document is a sequence of sentences
Politics

Each sentence in turn is a sequence of

words
We can again use a hierarchical RNN
to model this

… … …
Politics is decisions applying to group More narrowlyand

59/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Politics is the process of making decisions
Word level (1) encoder:
applying to all members of each group.
More narrowly, it refers to achieving and … h1ij = RNN(h1ij−1 , wij )
si = h1iTi [T is length of sentence i]
Politics

Sentence level (2) encoder:

h2i = RNN(h2i−1 , si )
s = h2K [K is number of sentences]

Decoder:
P(y|document) = softmax(Vs + b)

Params: W1enc , U1enc , W2enc , U2enc , V, b

… … … Loss: Cross Entropy
Politics is decisions applying to group More narrowlyand Algorithm: Gradient Descent with
backpropagation

60/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
How would you model attention in
such a hierarchical encoder decoder
model ?
We need attention at two levels
First we need to attend to important
(most informative) words in a
sentence
Then we need to attend to important
(most informative) sentences in a
document
Let us see how to model this

Figure: Hierarchical Attention Network

[Yang et al.]

61/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Data: {Documenti , classi }N
i=1
Word level (1) encoder:
hij = RNN(hij−1 , wij )
uij = tanh(Ww hij + bw )
exp(uT ij uw )
αij = ∑ T
t exp(uit uw )
∑
si = αij hij
j

Sentence level (2) encoder:

hi = RNN(hi−1 , si )
ui = tanh(Ws hi + bs )
exp(uT i us )
αi = ∑ Tu )
i exp(u i s
∑
Figure: Hierarchical Attention Network s= αi h i
[Yang et al.] i

62/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16
Decoder:

P(y|document) = softmax(Vs + b)

Parameters:
Ww , Ws , V, bw , bs , b, uw , us
Loss: cross entropy
Algorithm: Gradient Descent and
backpropagation

Figure: Hierarchical Attention Network

[Yang et al..]

63/63
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 16

Short Notes On Vanishing & Exploding Gradients
No ratings yet
Short Notes On Vanishing & Exploding Gradients
30 pages
LSTM
No ratings yet
LSTM
14 pages
RNN and LSTM Architectures Explained
No ratings yet
RNN and LSTM Architectures Explained
42 pages
RNN LSTM
No ratings yet
RNN LSTM
49 pages
Batch Normalization Separate
No ratings yet
Batch Normalization Separate
20 pages
Btech CSE
100% (1)
Btech CSE
17 pages
Multiple-Layer Networks Backpropagation Algorithms
No ratings yet
Multiple-Layer Networks Backpropagation Algorithms
46 pages
Deep Learning for CS Students
No ratings yet
Deep Learning for CS Students
75 pages
Artificial Intelligence in Mechanical Engineering: A Case Study On Vibration Analysis of Cracked Cantilever Beam
No ratings yet
Artificial Intelligence in Mechanical Engineering: A Case Study On Vibration Analysis of Cracked Cantilever Beam
4 pages
ANN-unit 3
No ratings yet
ANN-unit 3
30 pages
Deep Learning Basics Explained
No ratings yet
Deep Learning Basics Explained
21 pages
ML Lec 14 LeNeT CNN Architecture
No ratings yet
ML Lec 14 LeNeT CNN Architecture
14 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
PyTorch Neural Network Tutorial
No ratings yet
PyTorch Neural Network Tutorial
64 pages
DL Unit1 Final
No ratings yet
DL Unit1 Final
41 pages
Lecture Notes On Syntactic Processing
No ratings yet
Lecture Notes On Syntactic Processing
14 pages
Backpropagation in Neural Networks
No ratings yet
Backpropagation in Neural Networks
27 pages
Physics-Informed Machine Learning: A Survey On Problems, Methods and Applications
No ratings yet
Physics-Informed Machine Learning: A Survey On Problems, Methods and Applications
1 page
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
UNIT-I - Introduction To Computer Vision
No ratings yet
UNIT-I - Introduction To Computer Vision
45 pages
Deep Learning Course Overview
100% (1)
Deep Learning Course Overview
122 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
Unit 3
No ratings yet
Unit 3
14 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
17 pages
Deep Learning CNN Training Guide
No ratings yet
Deep Learning CNN Training Guide
20 pages
Deep Learning Module-01 Notes
No ratings yet
Deep Learning Module-01 Notes
69 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
CS7015 (Deep Learning) : Lecture 1
No ratings yet
CS7015 (Deep Learning) : Lecture 1
108 pages
Deep Learning: Understanding RNNs and LSTMs
No ratings yet
Deep Learning: Understanding RNNs and LSTMs
1 page
Deep Learning with RBMs and DBNs
No ratings yet
Deep Learning with RBMs and DBNs
79 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
DL CNN
No ratings yet
DL CNN
129 pages
Introduction to Convolutional Neural Networks
No ratings yet
Introduction to Convolutional Neural Networks
9 pages
Unit IV V Deep Learning Material
No ratings yet
Unit IV V Deep Learning Material
32 pages
ANN-unit 4
No ratings yet
ANN-unit 4
25 pages
Chap 9-1 Convolutional Neural Network - Keonwoo Noh
No ratings yet
Chap 9-1 Convolutional Neural Network - Keonwoo Noh
53 pages
7-Knowledge Distillation
No ratings yet
7-Knowledge Distillation
29 pages
Understanding Artificial Neural Networks
No ratings yet
Understanding Artificial Neural Networks
19 pages
What Is The Need For Residual Learning?
No ratings yet
What Is The Need For Residual Learning?
3 pages
DL - Midterm - Fall23
No ratings yet
DL - Midterm - Fall23
2 pages
BERT Final
No ratings yet
BERT Final
39 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
Vehicle Counting with AI: A Study
No ratings yet
Vehicle Counting with AI: A Study
11 pages
Various Neural Network Architect Assignment Questions
No ratings yet
Various Neural Network Architect Assignment Questions
9 pages
12-Regularization For Deep Learning-17!08!2024
No ratings yet
12-Regularization For Deep Learning-17!08!2024
51 pages
Deep Learning Unit-III
No ratings yet
Deep Learning Unit-III
9 pages
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
No ratings yet
Artificial Neural Networks Video Tutorial: Machine Learning 17CS73
23 pages
LeNet-5 and AlexNet Architectures Explained
No ratings yet
LeNet-5 and AlexNet Architectures Explained
13 pages
Stanford CS224d NLP Course Syllabus
No ratings yet
Stanford CS224d NLP Course Syllabus
3 pages
ResNet & VGGNet Deep Learning Guide
No ratings yet
ResNet & VGGNet Deep Learning Guide
44 pages
ML PPT Activation Functions
100% (1)
ML PPT Activation Functions
12 pages
Research Areas in Artificial Intelligence and Machine Learning
100% (1)
Research Areas in Artificial Intelligence and Machine Learning
72 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
81 pages
Lect3 UWA PDF
No ratings yet
Lect3 UWA PDF
73 pages
AIMLCZG521 - Conversational AI
No ratings yet
AIMLCZG521 - Conversational AI
488 pages
Intro4 ANN Deep CNN PDF
No ratings yet
Intro4 ANN Deep CNN PDF
20 pages
Unit 2 DL
No ratings yet
Unit 2 DL
43 pages
Hypothesis Space and Inductive Bias, Training, Test Data and Cross Validation
No ratings yet
Hypothesis Space and Inductive Bias, Training, Test Data and Cross Validation
53 pages
Lecture 16
No ratings yet
Lecture 16
311 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
Hdag Using HBase To Store and Access Data
No ratings yet
Hdag Using HBase To Store and Access Data
46 pages
Hbase Tutorial
No ratings yet
Hbase Tutorial
21 pages
H Base Tutorial
No ratings yet
H Base Tutorial
38 pages
Digital Forensics Module 1 - Notes
100% (1)
Digital Forensics Module 1 - Notes
60 pages
Machine Translation and Encoder
No ratings yet
Machine Translation and Encoder
13 pages
Short-Term Load Forecasting with AR and ANN
No ratings yet
Short-Term Load Forecasting with AR and ANN
7 pages
Graduate Computer Vision Course
No ratings yet
Graduate Computer Vision Course
1 page
Backpropagation in Neural Network - GeeksforGeeks
No ratings yet
Backpropagation in Neural Network - GeeksforGeeks
10 pages
Learning Rules
No ratings yet
Learning Rules
11 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
SVM Distance-Based Kernel Accuracy
No ratings yet
SVM Distance-Based Kernel Accuracy
1 page
2.5.1 Feedforward Neural Networks: Products Solutions Purchase Support Community Company Our Sites
No ratings yet
2.5.1 Feedforward Neural Networks: Products Solutions Purchase Support Community Company Our Sites
2 pages
Multilayer Perceptron Neural Network
No ratings yet
Multilayer Perceptron Neural Network
17 pages
AI Sequence Models for Students
No ratings yet
AI Sequence Models for Students
69 pages
Deep Learning Laboratory Viva Ques& Ans
No ratings yet
Deep Learning Laboratory Viva Ques& Ans
5 pages
CNN: Innovations and Future Prospects
No ratings yet
CNN: Innovations and Future Prospects
21 pages
Dropout Improves Recurrent Neural Networks For Handwriting Recognition
No ratings yet
Dropout Improves Recurrent Neural Networks For Handwriting Recognition
6 pages
Gen AI Unit 3
No ratings yet
Gen AI Unit 3
52 pages
LSTM RNNs for Acoustic Modeling
No ratings yet
LSTM RNNs for Acoustic Modeling
5 pages
Anna University Aiml
No ratings yet
Anna University Aiml
3 pages
45 Questions To Test A Data Scientist On Basics of Deep Learning (Along With Solution)
No ratings yet
45 Questions To Test A Data Scientist On Basics of Deep Learning (Along With Solution)
19 pages
Ann Black & White
No ratings yet
Ann Black & White
102 pages
Generative AI Interview Questions and Answers
100% (1)
Generative AI Interview Questions and Answers
7 pages
2017.ICML - Meprop Sparsified Back Propagation For Accelerated Deep Learning With Reduced Overfitting
No ratings yet
2017.ICML - Meprop Sparsified Back Propagation For Accelerated Deep Learning With Reduced Overfitting
10 pages
Deep Learning Overview and Concepts
No ratings yet
Deep Learning Overview and Concepts
55 pages
AIML2
No ratings yet
AIML2
1 page
Explanation of REINFORCE Training Code For CartPole
No ratings yet
Explanation of REINFORCE Training Code For CartPole
3 pages
Week 10
No ratings yet
Week 10
6 pages
LCTM and Gru
No ratings yet
LCTM and Gru
62 pages
Deep Learning Assignment 01
No ratings yet
Deep Learning Assignment 01
5 pages
Neural Network Backpropagation Practice
No ratings yet
Neural Network Backpropagation Practice
9 pages
BranchyNet Fast Inference Via Early Exiting From Deep Neural Networks
No ratings yet
BranchyNet Fast Inference Via Early Exiting From Deep Neural Networks
7 pages
Deep Learning M2-T1-Student Question Bank
No ratings yet
Deep Learning M2-T1-Student Question Bank
2 pages
Unit5 PPT
No ratings yet
Unit5 PPT
13 pages