Natural Language Processing
Pushpak Bhattacharyya
CSE Dept,
IIT Patna and Bombay
LSTM
15 jun, 2017 lgsoft:nlp:lstm:pushpak 1
Recap
15 jun, 2017 lgsoft:nlp:lstm:pushpak 2
Feedforward Network and
Backpropagation
15 jun, 2017 lgsoft:nlp:lstm:pushpak 3
Backpropagation algorithm
j …. Output layer
wji (m o/p
…. neurons)
i
Hidden layers
….
…. Input layer
(n i/p neurons)
n Fully connected feed forward network
n Pure FF network (no jumping of
connections over layers)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 4
General Backpropagation Rule
• General weight updating rule:
Δw ji = ηδjoi
• Where
δ j = (t j − o j )o j (1 − o j ) for outermost layer
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
15 jun, 2017 lgsoft:nlp:lstm:pushpak 5
Recurrent Neural Network
15 jun, 2017 lgsoft:nlp:lstm:pushpak 6
Sequence processing m/c
15 jun, 2017 lgsoft:nlp:lstm:pushpak 7
E.g. POS Tagging
VBD NNP NN
Purchased Videocon machine
15 jun, 2017 lgsoft:nlp:lstm:pushpak 8
E.g. Sentiment Analysis
I
Decision on a piece of text
h0 h1
c1
a14
a11 a12 a13
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 9
I like
h0 h1 h2
c2
a24
a21 a23
a22
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 10
I like the
h0 h1 h2 h3
c3
a31 a34
a32
a33
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 11
I like the camera
h0 h1 h2 h3 h4
c4
a41 a44
a42 a43
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 12
<EOS
I like the camera
>
h0 h1 h2 h3 h4 h5
Positive
sentiment
c5
a51 a54
a52 a53
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 13
Notation: input and state
n xt : input at time step t
n st : hidden state at time step t. It is the
“memory” of the network.
n st= f(U.xt+Wst-1) U and W matrices are
learnt
n f is Usually tanh or ReLU (approximated by
softplus)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 14
Tanh, ReLU (rectifier linear
unit) and Softplus
x −x
tanh = e −e
x −x
e +e tanh =
f ( x) = max( 0, x)
x
g ( x) = ln(1 + e )
15 jun, 2017 lgsoft:nlp:lstm:pushpak 15
Notation: output
n ot is the output at step t
n For example, if we wanted to predict
the next word in a sentence it would be
a vector of probabilities across our
vocabulary
n ot=softmax(V.st)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 16
Backpropagation through time
(BPTT algorithm)
n The forward pass at each time step.
n
n The backward pass computes the error
derivatives at each time step.
n After the backward pass we add
together the derivatives at all the
different times for each weight.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 17
A recurrent net for binary
addition
• Two input units and one output
unit.
• Given two input digits at each 00110100
time step.
• The desired output at each time 01001101
step is the output for the column
that was provided as input two 10000001
time steps ago.
– It takes one time step to
update the hidden units
based on the two input time
digits.
– It takes another time step for
the hidden units to cause the
output.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 18
The connectivity of the
network
• The input units have
feed forward
connections
• Allow them to vote 3 fully interconnected hidden
for the next hidden units
activity pattern.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 19
What the network learns
n Learns four distinct patterns of activity for the
3 hidden units.
n Patterns correspond to the nodes in the finite
state automaton
n Nodes in FSM are like activity vectors
n The automaton is restricted to be in exactly
one state at each time
n The hidden units are restricted to have exactly
one vector of activity at each time.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 20
Recall: Backpropagation Rule
• General weight updating rule:
Δw ji = ηδjoi
• Where
δ j = (t j − o j )o j (1 − o j ) for outermost layer
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
15 jun, 2017 lgsoft:nlp:lstm:pushpak 21
The problem of exploding or
vanishing gradients
– If the weights are small, the gradients shrink
exponentially
– If the weights are big the gradients grow
exponentially.
• Typical feed-forward neural nets can cope with
these exponential effects because they only
have a few hidden layers.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 22
LSTM
(Ack: Lecture notes of Taylor
Arnold, Yale and
http://colah.github.io/posts/
2015-08-Understanding-LSTMs/)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 23
LSTM: a variation of vanilla
RNN
Vanilla RNN
15 jun, 2017 lgsoft:nlp:lstm:pushpak 24
LSTM: complexity within the
block
15 jun, 2017 lgsoft:nlp:lstm:pushpak 25
Central idea
n Memory cell maintains its state over
time
n Non-linear gating units regulate the
information flow into and out of the cell
15 jun, 2017 lgsoft:nlp:lstm:pushpak 26
A simple line diagram for
LSTM
15 jun, 2017 lgsoft:nlp:lstm:pushpak 27
Stepping through Constituents
of LSTM
15 jun, 2017 lgsoft:nlp:lstm:pushpak 28
Again: Example of Refrigerator
complaint
n Visiting service person is becoming
rarer and rarer,
(ambiguous! ‘visit to service person’ OR ‘visit by service
person’?)
…
n and I am regretting/appreciating
my decision to have bought the
refrigerator from this company
(appreciating à ‘to’; regretting à ‘by’)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 29
Possibilities
n ‘Visiting’: ‘visit to’ or ‘visit
by’ (ambiguity, syntactic opacity)
n Problem: solved or unsolved (not
known, semantic opacity)
n ‘Appreciating’/’Regretting’: transparent;
available on the surface
15 jun, 2017 lgsoft:nlp:lstm:pushpak 30
4 possibilities (states)
Clue-1 Clue-2 Problem Sentiment
Visit to service Appreciating solved Positive
person
Visit to service Appreciating Not solved Not making
person sense!
Incoherent
Visit to service Regretting solved May be reverse
person sarcasm
Visit to service Regretting Not solved Negative
person
15 jun, 2017 lgsoft:nlp:lstm:pushpak 31
4 possibilities (states)
Clue-1 Clue-2 Problem Sentiment
Visit by service Appreciating solved Positive
person
Visit by service Appreciating Not solved May be sarcastic
person
Visit by service Regretting solved May be reverse
person sarcasm
Visit by service Regretting Not solved Negative
person
15 jun, 2017 lgsoft:nlp:lstm:pushpak 32
LSTM constituents: Cell State
The first and foremost component- the controller of flow of information
15 jun, 2017 lgsoft:nlp:lstm:pushpak 33
LSTM constituents- Forget
Gate
Helps forget irrelevant information. Sigmoid function. Output is between
0 and 1. Because of product, close to 1 will be full pass, close to 0 no pass
15 jun, 2017 lgsoft:nlp:lstm:pushpak 34
LSTM constituents: Input gate
tanh produces a cell state vector; multiplied with input gate which again
0-1 controls what and how much input goes FOWARD
15 jun, 2017 lgsoft:nlp:lstm:pushpak 35
Cell state operation
15 jun, 2017 lgsoft:nlp:lstm:pushpak 36
15 jun, 2017 lgsoft:nlp:lstm:pushpak 37
Finally
15 jun, 2017 lgsoft:nlp:lstm:pushpak 38
Better picture (the one we
started with)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 39
Another picture
15 jun, 2017 lgsoft:nlp:lstm:pushpak 40
LSTM schematic greff et al. LSTM a Space
Odyssey, arxiv 2015
15 jun, 2017 lgsoft:nlp:lstm:pushpak 41
Legend
15 jun, 2017 lgsoft:nlp:lstm:pushpak 42
Required mathematics
15 jun, 2017 lgsoft:nlp:lstm:pushpak 43
Training of LSTM
15 jun, 2017 lgsoft:nlp:lstm:pushpak 44
Many layers and gates
n Though complex, in principle possible to
train
n Gates are also sigmoid or tanh networks
n Remember the FUNDAMENTAL
backpropagation rule
15 jun, 2017 lgsoft:nlp:lstm:pushpak 45
General Backpropagation Rule
• General weight updating rule:
Δw ji = ηδjoi
• Where
δ j = (t j − o j )o j (1 − o j ) for outermost layer
= ∑ (w δ
k∈next layer
kj k )o j (1 − o j )oi for hidden layers
15 jun, 2017 lgsoft:nlp:lstm:pushpak 46
LSTM tools
n Tensorflow, Ocropus, RNNlib etc.
n Tools do everything internally
n Still insights and concepts are inevitable
15 jun, 2017 lgsoft:nlp:lstm:pushpak 47
LSTM applications
15 jun, 2017 lgsoft:nlp:lstm:pushpak 48
Many applications
n Language modeling (The tensorflow tutorial on PTB is a good
place to start Recurrent Neural Networks) character and word
level LSTM’s are used
n Machine Translation also known as sequence to sequence
learning (https://arxiv.org/pdf/1409.3215.pdf)
n Image captioning (with and without attention,
https://arxiv.org/pdf/1411.4555v...)
n Hand writing generation (http://arxiv.org/pdf/1308.0850v5...)
n Image generation using attention models - my favorite (
https://arxiv.org/pdf/1502.04623...)
n Question answering (http://www.aclweb.org/anthology/...)
n Video to text (https://arxiv.org/pdf/1505.00487...)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 49
Deep Learning Based Seq2Seq
Models and POS Tagging
Acknowledgement: Anoop Kunchukuttan, PhD Scholar, IIT Bombay
15 jun, 2017 lgsoft:nlp:lstm:pushpak 50
So far we are seen POS tagging as a sequence labelling task
For every element, predict the tag/label (using
function f )
I read the book ● Length of output
sequence is same as
input sequence
f f f f ● Prediction of tag at
time t can use only the
words seen till time t
PRP VB DT NN
15 jun, 2017 lgsoft:nlp:lstm:pushpak 51
We can also look at POS tagging as a sequence to sequence transformation
problem
Read the entire sequence and predict the output sequence (using
function F)
● Length of output
sequence need not be
I read the book
the same as input
sequence
● Prediction at any time
F step t has access to the
entire input
● A more general
PRP VB DT NN framework than
sequence labelling
15 jun, 2017 lgsoft:nlp:lstm:pushpak 52
Sequence to Sequence transformation is a more general framework than
sequence labelling
● Many other problems can be expressed as sequence to sequence
transformation
○ e.g. machine translation, summarization, question answering, dialog
● Adds more capabilities which can be useful for problems like MT:
○ many → many mappings: insertion/deletion to words, one-one
mappings
○ non-monotone mappings: reordering of words
● For POS tagging, these capabilites are not required
How does a sequence to sequence model work? Let’s see two paradigms
15 jun, 2017 lgsoft:nlp:lstm:pushpak 53
Encode - Decode Paradigm (5)… continue till
end of sequence
tag is generated
Use two RNN networks: the encoder and (4) Decoder
the decoder generates one
element at a
time
(3) This is used
to initialize the <EO
decoder state PRP VB DT NN
(1) Encoder S>
processes one
sequences at a
time h0 h1 h2 h3 h4
s0 s1 s1 s3 Decodin
s4
g
(2) A representation
of the sentence is
generated
I read the book
Encoding
15 jun, 2017 lgsoft:nlp:lstm:pushpak 54
This approach reduces the entire sentence representation to a
single vector
Two problems with this design choice:
● This is not sufficient to represent to capture all the syntactic and
semantic complexities of a sentence
○ Solution: Use a richer representation for the sentences
● Problem of capturing long term dependencies: The decoder RNN will
not be able to able to make use of source sentence representation after
a few time steps
○ Solution: Make source sentence information when making the next
prediction
○ Even better, make RELEVANT source sentence information
available
These solutions motivate the next paradigm
15 jun, 2017 lgsoft:nlp:lstm:pushpak 55
Encode - Attend - Decode Paradigm
Represent the source
Annotation
vectors
sentence by the set of
output vectors from the
encoder
Each output vector at time t
is a contextual
representation of the input
at time t
s0 s1 s2 s3
s4
Note: in the encoder-
decode paradigm, we
I read the book ignore the encoder outputs
Let’s call these encoder
output vectors annotation
vectors
15 jun, 2017 lgsoft:nlp:lstm:pushpak 56
How should the decoder use the set of annotation vectors while predicting
the next character?
Key Insight:
(1)Not all annotation vectors are equally important for prediction of the next
element
(2)The annotation vector to use next depends on what has been generated so
far by the decoder
eg. To generate the 3rd POS tag, the 3rd annotation vector (hence 3rd word) is
most important
One way to achieve this:
Take a weighted average of the annotation vectors, with more weight to
annotation vectors which need more focus or attention
This averaged context vector is an input to the decoder
For generation of ith output character:
ci : context vector
aij : annotation weight for the jth annotation
vector
oj: jth annotation vector
15 jun, 2017 lgsoft:nlp:lstm:pushpak 57
PRP
Let’s see an example of how the attention
h0 h1
mechanism works
c1
a14
a11 a12 a13
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 58
PRP VB
h0 h1 h2
c2
a24
a21 a23
a22
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 59
PRP VB DT
h0 h1 h2 h3
c3
a31 a34
a32
a33
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 60
PRP VB DT NN
h0 h1 h2 h3 h4
c4
a41 a44
a42 a43
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 61
<EOS
PRP VB DT NN
>
h0 h1 h2 h3 h4 h5
c5
a51 a54
a52 a53
o1 o2 o3 o4
15 jun, 2017 lgsoft:nlp:lstm:pushpak 62
But we do not know the attention weights?
How do we find them?
Let the training data help you decide!!
Idea: Pick the attention weights that maximize the POS
tagging accuracy
(more precisely, decrease training data
loss)function that predicts the attention weights:
Have an attention
aij = A(oj,hi;o)
A could be implemented as a feedforward network which is a component of the
overall network
Then training the attention network with the rest of the network ensures that
the attention weights are learnt to minimize the translation loss
15 jun, 2017 lgsoft:nlp:lstm:pushpak 63
OK, but do the attention weights actually show focus on
certain parts?
Here is an example of how attention weights represent a soft alignment for
machine translation
15 jun, 2017 lgsoft:nlp:lstm:pushpak 64
Let’s go back to the encoder. What type of encoder cell should we use there?
● Basic RNN: models sequence history by maintaining state information
○ But, cannot model long range dependencies
● LSTM: can model history and is better at handling long range dependencies
The RNN units model only the sequence seen so far, cannot see the sequence
ahead
● Can use a bidirectional RNN/LSTM
● This is just 2 LSTM encoders run from opposite ends of the sequence and
resulting output vectors are composed
Both types of RNN units process the sequence sequentially, hence parallelism is
limited
Alternatively, we can use a CNN
● Can operate on a sequence in parallel
● However, cannot model entire sequence history
● Model only a short local context. This may be sufficient for some
applications or deep CNN layers can overcome the problem
15 jun, 2017 lgsoft:nlp:lstm:pushpak 65
Convolutional Neural Network
(CNN)
15 jun, 2017 lgsoft:nlp:lstm:pushpak 66
CNN= feedforward +
recurrent!
n Whatever we learnt so far in FF-BP is useful
to understand CNN
n So also is the case with RNN (and LSTM)
n Input divided into regions and fed forward
n Window slides over the input: input changes,
but ‘filter’ parameters remain same
n That is RNN
15 jun, 2017 lgsoft:nlp:lstm:pushpak 67
Remember Neocognitron
15 jun, 2017 lgsoft:nlp:lstm:pushpak 68
15 jun, 2017 lgsoft:nlp:lstm:pushpak 69
Convolution § Matrix on the left represents an
black and white image.
§ Each entry corresponds to one
pixel, 0 for black and 1 for white
(typically it’s between 0 and 255
for grayscale images).
4 § The sliding window is called
3 a kernel, filter, or feature detector.
2 4 3
§ Here we use a 3×3 filter, multiply
2 3 4 its values element-wise with the
original matrix, then sum them up.
§ To get the full convolution we do
this for each element by sliding the
filter over the whole matrix.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 70
CNN architecture
n Several layers of convolution with tanh or ReLU
applied to the results
n In a traditional feedforward neural network we
connect each input neuron to each output neuron in
the next layer. That’s also called a fully connected
layer, or affine layer.
n In CNNs we use convolutions over the input layer to
compute the output.
n This results in local connections, where each region
of the input is connected to a neuron in the output
15 jun, 2017 lgsoft:nlp:lstm:pushpak 71
Learning in CNN
n Automatically learns the values of
its filters
n For example, in Image Classification
learn to
n detect edges from raw pixels in the first layer,
n then use the edges to detect simple shapes in the
second layer,
n and then use these shapes to deter higher-level
features, such as facial shapes in higher layers.
n The last layer is then a classifier that uses
these high-level features.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 72
Remember Neocognitron
15 jun, 2017 lgsoft:nlp:lstm:pushpak 73
15 jun, 2017 lgsoft:nlp:lstm:pushpak 74
What about NLP and CNN?
n Natural Match!
n NLP happens in
layers
15 jun, 2017 lgsoft:nlp:lstm:pushpak 75
NLP: multilayered,
multidimensional
Problem
Semantics NLP
Trinity
Parsing
Part of Speech
Tagging
Discourse and Coreference
Morph
Increased Analysis Marathi French
Complexity Semantics
Of HMM
Processing
Hindi English
Parsing
CRF
Language
MEMM
Algorithm
Chunking
POS tagging
Morphology
22 Apr, 2017 LG:nlp:pos:pushpak 76
NLP layers and CNN
n Morph layer à
n POS layer à
n Parse layer à
n Semantics layer
15 jun, 2017 lgsoft:nlp:lstm:pushpak 77
15 jun, 2017 lgsoft:nlp:lstm:pushpak 78
http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
15 jun, 2017 lgsoft:nlp:lstm:pushpak 79
Pooling
n Gives invariance in translation, rotation
and scaling
n Important for image recognition
n Role in NLP?
15 jun, 2017 lgsoft:nlp:lstm:pushpak 80
Input matrix for CNN: NLP
§“image” for NLP ßà word
vectors
§in the rows
3 4
§For a 10 word sentence using a
2 4 3
100-dimensional Embedding,
2 3 4
§we would have a 10×100 matrix
as our input
15 jun, 2017 lgsoft:nlp:lstm:pushpak 81
Credit: Denny Britz
CNN for NLP
15 jun, 2017 lgsoft:nlp:lstm:pushpak 82
CNN Hyper parameters
n Narrow width vs. wide width
n Stride size
n Pooling layers
n Channels
15 jun, 2017 lgsoft:nlp:lstm:pushpak 83
Abhijit Mishra, Kuntal Dey and Pushpak Bhattacharyya,
Learning Cognitive Features from Gaze Data for Sentiment and Sarcasm
Classification Using Convolutional Neural Network, ACL 2017, Vancouver, Canada,
July 30-August 4, 2017.
15 jun, 2017 lgsoft:nlp:lstm:pushpak 84
Learning Cognitive Features from Gaze
Data for Sentiment and Sarcasm
Classification
n In complex classification tasks like
sentiment analysis and sarcasm
detection, even the extraction and
choice of features should be delegated
to the learning system
n CNN learns features from both gaze
and text and uses them to classify the
input text
15 jun, 2017 lgsoft:nlp:lstm:pushpak 85