0% found this document useful (0 votes)

40 views42 pages

Lec14 Pretraining

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views42 pages

Lec14 Pretraining

Uploaded by

krishna chaitanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Pretraining Language Models

Wei Xu
(many slides from Greg Durrett)
Pretraining / ELMo

2
Recall: Context-dependent Embeddings
‣ How to handle diﬀerent word senses? One vector for balls

they dance at balls they hit the balls

‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representaQons as word vectors

3 Peters et al. (2018)

ELMo
‣ CNN over each word => RNN next word
RepresentaQon of visited
(plus vectors from
backwards LM)

4096-dim LSTMs w/ 512-dim projecQons

2048 CNN ﬁlters projected down to 512-dim

Char CNN Char CNN Char CNN Char CNN

John visited Madagascar yesterday

4 Peters et al. (2018)
How to apply ELMo?
‣ Take those embeddings and feed them Task predicQons (senQment, etc.)
into whatever architecture you want to
use for your task

‣ Frozen embeddings: update the weights Some neural network

of your network but keep ELMo’s
parameters frozen
‣ Fine-tuning: backpropagate all the way
into ELMo when training your model

they dance at balls

Peters, Ruder, Smith (2019)
5
Results: Frozen ELMo

‣ Massive improvements across 5 benchmark datasets: quesQon

answering, natural language inference, semanQc role labeling
(discussed later in the course), coreference resoluQon, named enQty
recogniQon, and senQment analysis
6
How to apply ELMo?

‣ How does frozen ( ❄ ) vs. ﬁne-tuned ( " ) compare?

‣ RecommendaQons:

Peters, Ruder, Smith (2019)

7
Why is language modeling a good objecQve?
‣ “Impossible” problem but bigger models seem to do be8er and be8er at
distribuQonal modeling (no upper limit yet)
‣ Successfully predicQng next words requires modeling lots of diﬀerent
eﬀects in text

‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world

knowledge and very challenging LM examples
‣ Coreference, Winograd schema, and much more
8
Why is language modeling a good objecQve?

9 Zhang and Bowman (2018)

Why did this take Qme to catch on?
‣ Earlier version of ELMo by the same authors in 2017, but it was
only evaluated on tagging tasks, gains were 1% or less

‣ Required: training on lots of data, having the right architecture,

signiﬁcant hyperparameter tuning

10
Probing ELMo
‣ From each layer of the ELMo model, a8empt to predict something:
POS tags, word senses, etc.
‣ Higher accuracy => ELMo is capturing that thing more nicely

11
BERT
BERT
‣ AI2 made ELMo in spring 2018, GPT was released in summer 2018, BERT
came out October 2018
‣ Three major changes compared to ELMo:
‣ Transformers instead of LSTMs (transformers in GPT as well)
‣ BidirecEonal <=> Masked LM objecEve instead of standard LM
‣ Fine-tune instead of freeze at test Eme
BERT

14
BERT
‣ ELMo is a unidirecEonal model (as is GPT): we can concatenate two
unidirecEonal models, but is this the right thing to do?
‣ ELMo reprs look at each direcEon in isolaEon; BERT looks at them jointly
“performer” ELMo

ELMo “ballet dancer”

A stunning ballet dancer, Copeland is one of the best performers to see live.

BERT

“ballet dancer/performer”
Devlin et al. (2019)
BERT
‣ How to learn a “deeply bidirecEonal” model? What happens if we just
replace an LSTM with a transformer?
ELMo (Language Modeling) BERT
visited Madag. yesterday … visited Madag. yesterday …

John visited Madagascar yesterday

‣ Transformer LMs have to be “one-
sided” (only a8end to previous
John visited Madagascar yesterday tokens), not what we want
Masked Language Modeling
‣ How to prevent cheaEng? Next word predicEon fundamentally doesn't
work for bidirecEonal models, instead do masked language modeling
Madagascar
‣ BERT formula: take a chunk of
text, predict 15% of the tokens
‣ For 80% (of the 15%),
replace the input token with
[MASK] John visited [MASK] yesterday

‣ For 10%, replace w/random John visited of yesterday

‣ For 10%, keep same John visited Madagascar yesterday

Devlin et al. (2019)
Next “Sentence” PredicEon
‣ Input: [CLS] Text chunk 1 [SEP] Text chunk 2
‣ 50% of the Eme, take the true next chunk of text, 50% of the Eme take a
random other chunk. Predict whether the next chunk is the “true” next
‣ BERT objecEve: masked LM + next sentence predicEon
NotNext Madagascar enjoyed like

Transformer
…
Transformer

[CLS] John visited [MASK] yesterday and really all it [SEP] I like Madonna.
Devlin et al. (2019)
BERT Architecture
‣ BERT Base: 12 layers, 768-dim
per wordpiece token, 12 heads.
Total params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ PosiEonal embeddings and
segment embeddings, 30k
word pieces
‣ This is the model that gets
pre-trained on a large corpus
Devlin et al. (2019)
What can BERT do?

‣ CLS token is used to provide classiﬁcaEon decisions

‣ Sentence pair tasks (entailment): feed both sentences into BERT
‣ BERT can also do tagging by predicEng tags at each word piece
Devlin et al. (2019)
What can BERT do?
Entails

Transformer
…
Transformer

[CLS] A boy plays in the snow [SEP] A boy is outside

‣ How does BERT model this sentence pair stuﬀ?

‣ Transformers can capture interacEons between the two sentences,
even though the NSP objecEve doesn’t really cause this to happen
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, ﬁll in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks
What can BERT NOT do?
‣ BERT cannot generate text (at least not in an obvious way)
‣ Not an autoregressive model, can do weird things like sEck a [MASK]
at the end of a string, ﬁll in the mask, and repeat
‣ Masked language models are intended to be used primarily for
“analysis” tasks

Lewis et al. (2019)

Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, batch size 2-32, learning rate 2e-5 - 5e-5
‣ Large changes to weights up here
(parEcularly in last layer to route the
right informaEon to [CLS])
‣ Smaller changes to weights lower down
in the transformer
‣ Small LR and short ﬁne-tuning schedule
mean weights don’t change much
‣ More complex “triangular
learning rate” schemes exist
Fine-tuning BERT

‣ BERT is typically be8er if the whole network is ﬁne-tuned, unlike ELMo

Peters, Ruder, Smith (2019)

EvaluaEon: GLUE

Wang et al. (2019)

Results

‣ Huge improvements over prior work (even compared to ELMo)

‣ EﬀecEve at “sentence pair” tasks: textual entailment (does sentence A

imply sentence B), paraphrase detecEon
Devlin et al. (2018)
RoBERTa
‣ “Robustly opEmized BERT”

‣ 160GB of data instead of

16 GB

‣ Dynamic masking: standard

BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them

‣ New training + more data = be8er performance

Liu et al. (2019)
GPT/GPT2/GPT3
OpenAI GPT/GPT2
‣ “ELMo with transformers” (works be8er than ELMo)
‣ Train a single unidirecEonal transformer LM on long contexts
‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit
‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019
‣ Because it's a language model, we can generate from it

Radford et al. (2019)

OpenAI GPT2

slide credit:
OpenAI
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
GPT3

https://twitter.com/cocoweixu/status/1285727605568811011
Pre-Training Cost (with Google/AWS)
‣ BERT: Base $500, Large $7000

‣ Grover-MEGA: $25,000

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ This is for a single pre-training run…developing new pre-training

techniques may require many runs

‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)

h8ps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
Pre-training Cost
And a lot more …

36
Analysis
What does BERT learn?

‣ Heads on transformers learn interesEng and diverse things: content

heads (a8end based on content), posiEonal heads (based on
posiEon), etc.
Clark et al. (2019)
What does BERT learn?

‣ SEll way worse than what supervised systems can do, but
interesEng that this is learned organically

Clark et al. (2019)

Probing BERT
‣ Try to predict POS, etc. from each layer.
Learn mixing weights

representaEon of wordpiece i
for task τ
‣ Plot shows s weights (blue) and
performance deltas when an addiEonal
layer is incorporated (purple)
‣ BERT “rediscovers the classical NLP pipeline”:
ﬁrst syntacEc tasks then semanEc ones Tenney et al. (2019)
Compressing BERT
‣ Remove 60+% of
BERT’s heads with
minimal drop in
performance

‣ DisElBERT (Sanh et al.,

2019): nearly as good with
half the parameters of BERT
(via knowledge disEllaEon)

Michel et al. (2019)

Open QuesEons
‣ BERT-based systems are state-of-the-art for nearly every major text
analysis task

‣ These techniques are here to stay, unclear what form will win out

‣ Role of academia vs. industry: no major pretrained model has come

purely from academia

‣ Cost/carbon footprint: a single model costs $10,000+ to train (though

this cost should come down)

NLP LLM
No ratings yet
NLP LLM
47 pages
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
No ratings yet
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
71 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
Understanding BERT and NLP Innovations
No ratings yet
Understanding BERT and NLP Innovations
98 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
BERT: Key Insights for NLP Students
No ratings yet
BERT: Key Insights for NLP Students
33 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
BERT Architecture
No ratings yet
BERT Architecture
23 pages
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Visualizing BERT & NLP Advances
No ratings yet
Visualizing BERT & NLP Advances
19 pages
Understanding BERT: Architecture & Applications
No ratings yet
Understanding BERT: Architecture & Applications
4 pages
Week 3: Deeplearning - Ai
No ratings yet
Week 3: Deeplearning - Ai
98 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Bert
No ratings yet
Bert
36 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
Understanding BERT and ELMo in NLP
No ratings yet
Understanding BERT and ELMo in NLP
20 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
BERT (Bidirectional Encoder Representations From Transformers)
No ratings yet
BERT (Bidirectional Encoder Representations From Transformers)
4 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
11 Bert
No ratings yet
11 Bert
66 pages
Class Notes
No ratings yet
Class Notes
43 pages
Bert
No ratings yet
Bert
20 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
99 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Evolution of NLP Models: LSTM to BERT
No ratings yet
Evolution of NLP Models: LSTM to BERT
30 pages
Warm-Starting Encoder-Decoder Models
No ratings yet
Warm-Starting Encoder-Decoder Models
50 pages
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
No ratings yet
How To Fine-Tune BERT For Text Classification?: Corresponding Author The Source Codes Are Available at
10 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
BERT for NLP Experts
No ratings yet
BERT for NLP Experts
17 pages
BERT: Bidirectional Language Model
No ratings yet
BERT: Bidirectional Language Model
10 pages
BERT Applications in Natural Language Processing: A Review
No ratings yet
BERT Applications in Natural Language Processing: A Review
49 pages
BERT: Bidirectional Encoder Insights
No ratings yet
BERT: Bidirectional Encoder Insights
24 pages
NLP Year in Review - 2019 - Dair - Ai - Medium
No ratings yet
NLP Year in Review - 2019 - Dair - Ai - Medium
26 pages
Language Models for NLP Experts
No ratings yet
Language Models for NLP Experts
31 pages
Unit - 7 NLP
No ratings yet
Unit - 7 NLP
14 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Lecture 12 Pretraining
No ratings yet
Lecture 12 Pretraining
46 pages
Neo Bert
No ratings yet
Neo Bert
19 pages
ACL Exp7
No ratings yet
ACL Exp7
7 pages
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
No ratings yet
The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer Learning) - Jay Alammar - Visualizing Machine Learning One Concept at A Time.
4 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
No ratings yet
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
23 pages
Data Mining Report
No ratings yet
Data Mining Report
17 pages
Transformer Models in Opinion Mining
No ratings yet
Transformer Models in Opinion Mining
16 pages
Pre-Training & LLM 2
No ratings yet
Pre-Training & LLM 2
46 pages
ParsBERT: Persian Language Model
No ratings yet
ParsBERT: Persian Language Model
10 pages
Transformer Models: Reasoning Limits
No ratings yet
Transformer Models: Reasoning Limits
28 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
14 LookingForward
No ratings yet
14 LookingForward
48 pages
Pretrained Transformers Insights
No ratings yet
Pretrained Transformers Insights
42 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
Bert, GPT
No ratings yet
Bert, GPT
23 pages
Punctuation Restoration Using BERTs Variants
No ratings yet
Punctuation Restoration Using BERTs Variants
11 pages
Understanding BERT's Bidirectional Encoder
No ratings yet
Understanding BERT's Bidirectional Encoder
8 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Online Electricity Bill Payment Fees
No ratings yet
Online Electricity Bill Payment Fees
1 page
L11 ClassIntricacies
No ratings yet
L11 ClassIntricacies
9 pages
L14 OptimizationSingleVariable
No ratings yet
L14 OptimizationSingleVariable
33 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
No ratings yet
Shaver - 2006 - Attachment Theory Individual Psychodynamics and Relationship Functioning
38 pages
L6 Tuple Container
No ratings yet
L6 Tuple Container
18 pages
Data Science: Dictionaries Guide
No ratings yet
Data Science: Dictionaries Guide
17 pages
Data Science: Set Operations in Python
No ratings yet
Data Science: Set Operations in Python
16 pages
The CEO's Guide To The Generative AI Revolution - BCG
100% (2)
The CEO's Guide To The Generative AI Revolution - BCG
14 pages
Fine Tuning
No ratings yet
Fine Tuning
5 pages
Generative AI Chatbot - Pdf.crdownload
No ratings yet
Generative AI Chatbot - Pdf.crdownload
63 pages
Synthesis of Stanford AI Index Report 2025 Key Findings
No ratings yet
Synthesis of Stanford AI Index Report 2025 Key Findings
4 pages
OpenAI Developers Handbook
No ratings yet
OpenAI Developers Handbook
132 pages
Mastering Chatgpt and LLM in 2024
100% (2)
Mastering Chatgpt and LLM in 2024
71 pages
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
No ratings yet
From GPT To BERT:: Benchmarking Large Language Models For Automated Iz Generation
2 pages
ChatGPT Bible Entrepreneur's Special Edition Unlocking Secret AI-Powered Strategies For Unprecedented Business Growth
100% (13)
ChatGPT Bible Entrepreneur's Special Edition Unlocking Secret AI-Powered Strategies For Unprecedented Business Growth
150 pages
Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models
No ratings yet
Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models
17 pages
Usenixsecurity24 Deng
No ratings yet
Usenixsecurity24 Deng
19 pages
Ai Book
No ratings yet
Ai Book
101 pages
GPT 3
No ratings yet
GPT 3
15 pages
Adam Buick - Copyright and AI Training Data - Transparency To The Rescue
No ratings yet
Adam Buick - Copyright and AI Training Data - Transparency To The Rescue
11 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (15)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Simulating Strategic Reasoning: Comparing The Ability of Single Llms and Multi-Agent Systems To Replicate Human Behavior
No ratings yet
Simulating Strategic Reasoning: Comparing The Ability of Single Llms and Multi-Agent Systems To Replicate Human Behavior
10 pages
Ethical Use of ChatGPT in Education
No ratings yet
Ethical Use of ChatGPT in Education
22 pages
Math Problem Subquestioning
No ratings yet
Math Problem Subquestioning
15 pages
Fine-Tuning LLMs with PEFT & LoRa Techniques
No ratings yet
Fine-Tuning LLMs with PEFT & LoRa Techniques
25 pages
Chatgpt
No ratings yet
Chatgpt
30 pages
Cheat Sheet Azure AI Engineer Associate AI 102
100% (1)
Cheat Sheet Azure AI Engineer Associate AI 102
29 pages
Unveiling Security, Privacy, and Ethical Concerns of ChatGPT
No ratings yet
Unveiling Security, Privacy, and Ethical Concerns of ChatGPT
14 pages
UBMK Template A4
No ratings yet
UBMK Template A4
6 pages
AI Startup Ideas Collection
No ratings yet
AI Startup Ideas Collection
27 pages
AgXQA: Advancing Agri-Extension with LLMs
No ratings yet
AgXQA: Advancing Agri-Extension with LLMs
14 pages
University Chatbot with ChatGPT API
No ratings yet
University Chatbot with ChatGPT API
5 pages
R P: Retrieval-Augmented Black-Box Language Models: E LUG
No ratings yet
R P: Retrieval-Augmented Black-Box Language Models: E LUG
12 pages
CODEJUDGE Evaluating Code Generation With Large Language
No ratings yet
CODEJUDGE Evaluating Code Generation With Large Language
20 pages
A214 Ayush Nigam Seminar-1
No ratings yet
A214 Ayush Nigam Seminar-1
16 pages
Ai Tools PDF
No ratings yet
Ai Tools PDF
273 pages
CLAM - Selective Clarification For Ambiguous Questions With Generative Language Models
No ratings yet
CLAM - Selective Clarification For Ambiguous Questions With Generative Language Models
17 pages

Lec14 Pretraining

Uploaded by

Lec14 Pretraining

Uploaded by

Pretraining Language Models

they dance at balls they hit the balls

3 Peters et al. (2018)

4096-dim LSTMs w/ 512-dim projecQons

Char CNN Char CNN Char CNN Char CNN

John visited Madagascar yesterday

‣ Frozen embeddings: update the weights Some neural network

they dance at balls

‣ Massive improvements across 5 benchmark datasets: quesQon

‣ How does frozen ( ❄ ) vs. ﬁne-tuned ( " ) compare?

Peters, Ruder, Smith (2019)

‣ LAMBADA dataset (Papernot et al., 2016): explicitly targets world

9 Zhang and Bowman (2018)

‣ Required: training on lots of data, having the right architecture,

ELMo “ballet dancer”

John visited Madagascar yesterday

‣ For 10%, replace w/random John visited of yesterday

‣ For 10%, keep same John visited Madagascar yesterday

‣ CLS token is used to provide classiﬁcaEon decisions

[CLS] A boy plays in the snow [SEP] A boy is outside

‣ How does BERT model this sentence pair stuﬀ?

Lewis et al. (2019)

‣ BERT is typically be8er if the whole network is ﬁne-tuned, unlike ELMo

Peters, Ruder, Smith (2019)

Wang et al. (2019)

‣ Huge improvements over prior work (even compared to ELMo)

‣ EﬀecEve at “sentence pair” tasks: textual entailment (does sentence A

‣ 160GB of data instead of

‣ Dynamic masking: standard

‣ New training + more data = be8er performance

Radford et al. (2019)

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ This is for a single pre-training run…developing new pre-training

‣ Heads on transformers learn interesEng and diverse things: content

Clark et al. (2019)

‣ DisElBERT (Sanh et al.,

Michel et al. (2019)

‣ Role of academia vs. industry: no major pretrained model has come

‣ Cost/carbon footprint: a single model costs $10,000+ to train (though

You might also like