Pretraining Language Models
Wei Xu
(many slides from Greg Durrett)
Pretraining / ELMo
2
Recall: Context-dependent Embeddings
‣ How to handle different word senses? One vector for balls
they dance at balls they hit the balls
‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representaQons as word vectors
3 Peters et al. (2018)
ELMo
‣ CNN over each word => RNN next word
RepresentaQon of visited
(plus vectors from
backwards LM)
4096-dim LSTMs w/ 512-dim projecQons
2048 CNN filters projected down to 512-dim
Char CNN Char CNN Char CNN Char CNN
John visited Madagascar yesterday
4 Peters et al. (2018)
How to apply ELMo?
‣ Take those embeddings and feed them Task predicQons (senQment, etc.)
into whatever architecture you want to
use for your task
‣ Frozen embeddings: update the weights Some neural network
of your network but keep ELMo’s
parameters frozen
‣ Fine-tuning: backpropagate all the way
into ELMo when training your model
they dance at balls
Peters, Ruder, Smith (2019)
5
Results: Frozen ELMo
‣ Massive improvements across 5 benchmark datasets: quesQon
answering, natural language inference, semanQc role labeling
(discussed later in the course), coreference resoluQon, named enQty
recogniQon, and senQment analysis
6
How to apply ELMo?
‣ How does frozen ( ❄ ) vs. fine-tuned ( " ) compare?
‣ RecommendaQons:
Peters, Ruder, Smith (2019)
7
Why is language modeling a good objecQve?
‣ “Impossible” problem but bigger models seem to do be8er and be8er at
distribuQonal modeling (no upper limit yet)
‣ Successfully predicQng next words requires modeling lots of different
effects in text
‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world
knowledge and very challenging LM examples
‣ Coreference, Winograd schema, and much more
8
Why is language modeling a good objecQve?
9 Zhang and Bowman (2018)
Why did this take Qme to catch on?
‣ Earlier version of ELMo by the same authors in 2017, but it was
only evaluated on tagging tasks, gains were 1% or less
‣ Required: training on lots of data, having the right architecture,
significant hyperparameter tuning
10
Probing ELMo
‣ From each layer of the ELMo model, a8empt to predict something:
POS tags, word senses, etc.
‣ Higher accuracy => ELMo is capturing that thing more nicely
11
BERT
BERT
‣ AI2 made ELMo in spring 2018, GPT was released in summer 2018, BERT
came out October 2018
‣ Three major changes compared to ELMo:
‣ Transformers instead of LSTMs (transformers in GPT as well)
‣ BidirecEonal <=> Masked LM objecEve instead of standard LM
‣ Fine-tune instead of freeze at test Eme
BERT
14
BERT
‣ ELMo is a unidirecEonal model (as is GPT): we can concatenate two
unidirecEonal models, but is this the right thing to do?
‣ ELMo reprs look at each direcEon in isolaEon; BERT looks at them jointly
“performer” ELMo
ELMo “ballet dancer”
A stunning ballet dancer, Copeland is one of the best performers to see live.
BERT
“ballet dancer/performer”
Devlin et al. (2019)
BERT
‣ How to learn a “deeply bidirecEonal” model? What happens if we just
replace an LSTM with a transformer?
ELMo (Language Modeling) BERT
visited Madag. yesterday … visited Madag. yesterday …
John visited Madagascar yesterday
‣ Transformer LMs have to be “one-
sided” (only a8end to previous
John visited Madagascar yesterday tokens), not what we want
Masked Language Modeling
‣ How to prevent cheaEng? Next word predicEon fundamentally doesn't
work for bidirecEonal models, instead do masked language modeling
Madagascar
‣ BERT formula: take a chunk of
text, predict 15% of the tokens
‣ For 80% (of the 15%),
replace the input token with
[MASK] John visited [MASK] yesterday
‣ For 10%, replace w/random John visited of yesterday
‣ For 10%, keep same John visited Madagascar yesterday
Devlin et al. (2019)
Next “Sentence” PredicEon
‣ Input: [CLS] Text chunk 1 [SEP] Text chunk 2
‣ 50% of the Eme, take the true next chunk of text, 50% of the Eme take a
random other chunk. Predict whether the next chunk is the “true” next
‣ BERT objecEve: masked LM + next sentence predicEon
NotNext Madagascar enjoyed like
Transformer
…
Transformer
[CLS] John visited [MASK] yesterday and really all it [SEP] I like Madonna.
Devlin et al. (2019)
BERT Architecture
‣ BERT Base: 12 layers, 768-dim
per wordpiece token, 12 heads.
Total params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ PosiEonal embeddings and
segment embeddings, 30k
word pieces
‣ This is the model that gets
pre-trained on a large corpus
Devlin et al. (2019)
What can BERT do?
‣ CLS token is used to provide classificaEon decisions
‣ Sentence pair tasks (entailment): feed both sentences into BERT
‣ BERT can also do tagging by predicEng tags at each word piece
Devlin et al. (2019)
What can BERT do?
Entails
Transformer
…
Transformer
[CLS] A boy plays in the snow [SEP] A boy is outside
‣ How does BERT model this sentence pair stuff?
‣ Transformers can capture interacEons between the two sentences,
even though the NSP objecEve doesn’t really cause this to happen
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, fill in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, fill in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks
Lewis et al. (2019)
Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, batch size 2-32, learning rate 2e-5 - 5e-5
‣ Large changes to weights up here
(parEcularly in last layer to route the
right informaEon to [CLS])
‣ Smaller changes to weights lower down
in the transformer
‣ Small LR and short fine-tuning schedule
mean weights don’t change much
‣ More complex “triangular
learning rate” schemes exist
Fine-tuning BERT
‣ BERT is typically be8er if the whole network is fine-tuned, unlike ELMo
Peters, Ruder, Smith (2019)
EvaluaEon: GLUE
Wang et al. (2019)
Results
‣ Huge improvements over prior work (even compared to ELMo)
‣ EffecEve at “sentence pair” tasks: textual entailment (does sentence A
imply sentence B), paraphrase detecEon
Devlin et al. (2018)
RoBERTa
‣ “Robustly opEmized BERT”
‣ 160GB of data instead of
16 GB
‣ Dynamic masking: standard
BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them
‣ New training + more data = be8er performance
Liu et al. (2019)
GPT/GPT2/GPT3
OpenAI GPT/GPT2
‣ “ELMo with transformers” (works be8er than ELMo)
‣ Train a single unidirecEonal transformer LM on long contexts
‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit
‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019
‣ Because it's a language model, we can generate from it
Radford et al. (2019)
OpenAI GPT2
slide credit:
OpenAI
GPT3
https://twitter.com/cocoweixu/status/1285727605568811011
GPT3
https://twitter.com/cocoweixu/status/1285727605568811011
Pre-Training Cost (with Google/AWS)
‣ BERT: Base $500, Large $7000
‣ Grover-MEGA: $25,000
‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)
‣ This is for a single pre-training run…developing new pre-training
techniques may require many runs
‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)
h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
Pre-training Cost
And a lot more …
36
Analysis
What does BERT learn?
‣ Heads on transformers learn interesEng and diverse things: content
heads (a8end based on content), posiEonal heads (based on
posiEon), etc.
Clark et al. (2019)
What does BERT learn?
‣ SEll way worse than what supervised systems can do, but
interesEng that this is learned organically
Clark et al. (2019)
Probing BERT
‣ Try to predict POS, etc. from each layer.
Learn mixing weights
representaEon of wordpiece i
for task τ
‣ Plot shows s weights (blue) and
performance deltas when an addiEonal
layer is incorporated (purple)
‣ BERT “rediscovers the classical NLP pipeline”:
first syntacEc tasks then semanEc ones Tenney et al. (2019)
Compressing BERT
‣ Remove 60+% of
BERT’s heads with
minimal drop in
performance
‣ DisElBERT (Sanh et al.,
2019): nearly as good with
half the parameters of BERT
(via knowledge disEllaEon)
Michel et al. (2019)
Open QuesEons
‣ BERT-based systems are state-of-the-art for nearly every major text
analysis task
‣ These techniques are here to stay, unclear what form will win out
‣ Role of academia vs. industry: no major pretrained model has come
purely from academia
‣ Cost/carbon footprint: a single model costs $10,000+ to train (though
this cost should come down)