Deep contextualized
word representations
Matthew E. Peters, Mark Neumann, Mohit
Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, Luke Zettlemoyer
NAACL-HLT 2018
Overview
• Propose a new type of deep contextualised word
representations (ELMo) that model:
• Overview ‣ Complex characteristics of word use (e.g., syntax and
semantics)
• Method
‣ How these uses vary across linguistic contexts (i.e., to
• Evaluation
model polysemy)
• Analysis
• Show that ELMo can improve existing neural models in
• Comments various NLP tasks
• Argue that ELMo can capture more abstract linguistic
characteristics in the higher level of layers
33
Example
GloVe mostly learns
sport-related context
ELMo can distinguish the word
sense based on the context
34
Method
• Embeddings from Language Models: ELMo
• Overview • Learn word embeddings through building
• Method bidirectional language models (biLMs)
• Evaluation ‣ biLMs consist of forward and backward LMs
N
∏
• Analysis ✦ Forward: p(t1, t2, …, tN ) = p(tk | t1, t2, …, tk−1)
k=1
• Comments
N
∏
✦ Backward: p(t1, t2, …, tN ) = p(tk | tk+1, tk+2, …, tN )
k=1
35
Method
With long short term memory (LSTM) network,
predicting the next words in both directions to build
• Overview biLMs
The forward LM architecture Expanded in the forward direction of k
• Method a nice one
Output layer ok
• Evaluation k−1
• Analysis Hidden layers
h LM
k2
(LSTMs) k−1
• Comments h LM
k1
Embedding layer xk
tk … have a nice one …
36
Method
ELMo represents a word tk as a linear combination of
corresponding hidden layers (inc. its embedding)
ELMo is a task specific
representation. A down-stream
biLMs
task learns weighting parameters
Forward LM Backward LM
{
Concatenate
ok ok
s2task × hLM
k2 hidden layers k−1 k+1
h LM h LM
ELMotask
k = γ task
× ∑ s1task × hLM
k1
k2
k−1
k2
k+1
h LM h LM
s0task × hLM
k0
[ h LM LM
kj ; h kj ]
k1 k1
([xk ; xk])
xk
Unlike usual word embeddings, ELMo is
assigned to every token instead of a type tk tk
37
Method
ELMo can be integrated to almost all neural NLP tasks
with simple concatenation to the embedding layer
• Overview Corpus
• Method
Enhance inputs
• Evaluation with ELMos
Train
• Analysis
• Comments
ELMo ELMo ELMo
biLMs
Usual inputs have a nice
38
Evaluation
Many linguistic tasks are improved by using ELMo
• Overview
• Method Q&A
Textual entailment
• Evaluation Semantic role labelling
Coreference resolution
• Analysis Named entity recognition
Sentiment analysis
• Comments
39
Analysis
The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features
• Overview Word sense disambiguation PoS tagging
• Method
• Evaluation
• Analysis
• Comments
40
Analysis
The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features???
• Overview
• Method Most models preferred
“syntactic (probably)” features
• Evaluation
Even in sentiment analysis
• Analysis
• Comments
41
Analysis
ELMo-enhanced models can make use of small
datasets more efficiently
Textual entailment Semantic role labelling
• Overview
• Method
• Evaluation
• Analysis
• Comments
42
Comments
• Pre-trained ELMo models are available at https://
allennlp.org/elmo
• Overview ‣ AllenNLP is a deep NLP library on top of PyTorch
• Method
‣ AllenNLP is a product of AI2 (Allen Institute for
• Evaluation Artificial Intelligence) which works on other
• Analysis interesting projects like Semantic Scholar
• Comments • ELMo can process character-level inputs
‣ Japanese (Chinese, Korean, …) ELMo models likely
to be possible
43