Introduction to Large Language Model
Kun Yuan (袁 坤)
Feb 20, 2024
Contents
• Large language model (LLM)
• How to effectively train LLM
• How to effectively use LLM
• Course plans
Note: The main contents of this lecture is summarized from two wonderful talks [1,2] by Andrej Karpathy
[1] State of GPT
[2] The busy person’s intro to LLMs
<2>
Teaching assistants
白禹东 耿云腾 何雨桐 李佩津 刘梓豪
鲁可儿 宋奕龙 孙乾祐 王宇驰
PART 01
Large language model (LLM)
Large language model
• Meta Llama 2 is probably the most powerful open-source LLM
• Weights, architectures, and the paper were all released by Meta
• Neural network parameters + the code to run them; that’s all you need
• No need to access your WIFI. Just one laptop
<5>
Large language model
<6>
What is the model parameter?
• LLM can be regarded as a magic function that maps the context to the next word
• Model parameter parameterize the magic function to a series of matrix-matrix(vector) products
<latexit sha1_base64="ZrH5KG8RDPFMJZfO36vqFz9dXhc=">AAACG3icbVDLSgMxFM3Ud31VXboJFmndlJlSVBBBdONSwT6gLW0mzbShmWRI7ohl6H+48VfcuFDEleDCvzGtFbT1QOBwzrnk3uNHghtw3U8nNTe/sLi0vJJeXVvf2MxsbVeMijVlZaqE0jWfGCa4ZGXgIFgt0oyEvmBVv38x8qu3TBuu5A0MItYMSVfygFMCVmplikG+EfrqLmm3rYQNB6wkJrnc8KQBPQbkAJ/in0RIwBqtTNYtuGPgWeJNSBZNcNXKvDc6isYhk0AFMabuuRE0E6KBU8GG6UZsWERon3RZ3VJJQmaayfi2Id63SgcHStsnAY/V3xMJCY0ZhL5N2vV6Ztobif959RiC42bCZRQDk/T7oyAWGBQeFYU7XDMKYmAJoZrbXTHtEU0o2DrTtgRv+uRZUikWvMNC6bqUPTuf1LGMdtEeyiMPHaEzdImuUBlRdI8e0TN6cR6cJ+fVefuOppzJzA76A+fjC/adoCE=</latexit>
f (“cat sit on a”; ✓) = “mat”
• Given the model parameter ✓ , LLM can predict the next word
<latexit sha1_base64="pYM132qgRhMHaU/a61ywLFbdrWg=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Ae0oWy2m3btZhN2J0IJ/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ilVg9HHGm/XHGr7hxklXg5qUCORr/81RvELI24QiapMV3PTdDPqEbBJJ+WeqnhCWVjOuRdSxWNuPGz+bVTcmaVAQljbUshmau/JzIaGTOJAtsZURyZZW8m/ud1Uwyv/UyoJEWu2GJRmEqCMZm9TgZCc4ZyYgllWthbCRtRTRnagEo2BG/55VXSuqh6l9Xafa1Sv8njKMIJnMI5eHAFdbiDBjSBwSM8wyu8ObHz4rw7H4vWgpPPHMMfOJ8/puWPMQ==</latexit>
<7>
LLM can generate texts of various styles
code book information wikipedia
<8>
How to get the weights? Training the deep neural network
• Use tremendous data and computing resources to get the valuable model parameters
• Very very expensive; update the model weights probably once a year or once a few years
<9>
How to make LLM as your personal copilot? PE and finetune
• Over 90% of my interactions with ChatGPT are
• But we should use LLM more frequently and smartly. It can be your personal copilot
• It is not easy to have your own LLM copilot. You need to know prompt engineering and finetune
< 10 >
PART 02
ChatGPT Training Pipeline
ChatGPT training pipeline has 4 stages
Source: Andrej Karpathy, State of GPT < 12 >
Pretraining
99% training time
and resource
Source: Andrej Karpathy, State of GPT < 13 >
Pretraining
Data collection
Crawled data from websites; in both high quality and low quality
High-quality data
Training data mixture used in lLaMA model
< 14 >
Pretraining
Tokenization (分词)
Transform long texts to lists of integers
< 15 >
Pretraining
Token and vocabulary
Sentence: "The cat sat on the mat. The cat is orange."
Token: ["The", "cat", "sat", "on", "the", "mat", ".", "The", "cat", "is", "orange", "."]
Vocabulary : {"The", "cat", "sat", "on", "the", "mat", ".", "is", "orange"}
Vocabulary is a set with each element unique
< 16 >
Pretraining
< 17 >
Pretraining
While GPT-3 is larger, LLaMa utilizes more tokens. In practice, LLaMA significantly performs better.
We cannot judge the power of one LLM model only by its number of parameters; data also matters
It is still in debate that whether one should increase model size or data size given limited resource budget
< 18 >
Pretraining
< 19 >
Pretraining
< 20 >
Pretraining
• Effective representation learning
• Long-range dependency with attention
• Parallelizable architecture
• Flexibility and Adaptability
(In recent popular SORA, Diffusion +
transformer is used)
Transformer architecture
(will discuss it in later lectures)
< 21 >
Pretraining
[Training Compute-Optimal Large Language Models]
Larger dataset + bigger model + longer training
=
better prediction accuracy
A very straightforward way to achieve good LLM.
All you need is MONEY!
Amazing representation power
< 22 >
Pretraining
Larger dataset + bigger model + longer training = better prediction accuracy
< 23 >
Pretraining
< 24 >
Pretraining
• Pretraining a base model is extremely expensive
• Several effective pretraining techniques:
§ 3D parallelism: data/model/tensor parallelism
§ Memory-efficient optimizers
§ Large-batch training
§ Mixed-precision training
• Will discuss them later lectures
< 25 >
Pretrained model provides strong transfer learning capabilities
Pretrained base model performs well after finetuning
< 26 >
Pretrained model provides strong transfer learning capabilities
• Pretraining + finetuning/prompting reshapes the AI industry.
• Pretrained base model only needs a small amount of data to be adapted to the down-stream applications.
• The cost to deploy AI to down-stream applications decreases significantly
§ Achieve powerful base models from OpenAI/Google/Meta/GitHub
§ Collect a small number of downstream data and use it to finetune the base model
§ No need for expensive investment of money and talents
< 27 >
Pretraining
Base models in the wild
< 28 >
Pretraining
LLaMA and Bloom are popular open-source base models
• LLaMA https://github.com/facebookresearch/llama
• Bloom https://huggingface.co/bigscience
< 29 >
Supervised Finetuning
< 30 >
Supervised Finetuning
Base models cannot be deployed directly. It is still far away from being a smart assistant
< 31 >
Supervised Finetuning
Base models can be
tricked into being AI
assistants with prompting
We need to finetune the
base model to make it
chat like humans
< 32 >
Supervised Finetuning
Ask human contractors to
respond to prompts and
generate high-quality,
helpful, truthful, and
harmless responses
Collect 10,000+ high-
quality human-generated
responses
Finetune base models with
these high-quality data
< 33 >
Supervised Finetuning
• Dataset: 10~100K human-generated data pairs {(prompt, response)}
• Training: repeat what we did in the “Pretraining” stage
• After supervised finetuning stage, base models can chat like humans
• 1-100 GPUs; days of training; but can still be very expensive due to human-generated data
• To save money, some (or most) models use ChatGPT-generated data to finetune
< 34 >
Reward modeling
< 35 >
Reward modeling
• SFT model performs like an “assistant”, but still not good enough.
• To further improve it, one can ask human contractors to generate more data; effective but expensive
• Another way is to let the model learn what response is good, and how to generate good response
• Reward model will enable GPT to judge whether a certain response is good or not
• Reward model will be used in the Reinforcement learning stage to reinforce good response
< 36 >
Reward modeling
Dataset
SFT model
generates different
responses to the
same prompt
< 37 >
Reward modeling
Dataset
SFT model
generates different
responses to the
same prompt
Ask contractors to
rank the responses;
much cheaper
< 38 >
Reward modeling
Dataset
SFT model
generates different
responses to the
same prompt
Ask contractors to
rank the responses;
much cheaper
Dataset:
{(prompt, response,
reward)}
< 39 >
Reward modeling
• Given a prompt, SFT model generates several responses, and then makes a reward prediction (green).
• This reward will be supervised by ground-truth reward.
• After training, we achieve a RW model that can predict the reward after its generated response.
< 40 >
Reinforcement learning
< 41 >
Reinforcement learning
RL makes the model learn to generate responses with great scores
< 42 >
Reinforcement learning
< 43 >
Reinforcement learning
< 44 >
ChatGPT training pipeline
Source: Andrej Karpathy, State of GPT < 45 >
Assistant models in the wild
< 46 >
A short summary
• We discuss the pipeline to train ChatGPT
• SFT, RM, and RL are critical to transform GPT to ChatGPT
• SFT, RM, and RL are also critical to transform GPT to your own personalized assistant
< 47 >
PART 03
Use LLM Effectively As Your Personal Copilot
Understand how human and LLM work differently
• Human can plan and reflect
• Human can use tools
• Human typically thinks more
< 49 >
Understand how human and LLM work differently
• LLM strips away all human behavior
< 50 >
Use prompt to help LLM work like a human
• Chain of thoughts: break up tasks into multiple steps/stages
(will discuss it in later lectures)
< 51 >
Tree of thought
• Tree of thoughts: expand thoughts, evaluate them and then go deeper
(will discuss it in later lectures)
• How to find simple and effective prompts are still a hot research topic
< 52 >
Prompt ensemble
< 53 >
Ask for reflection
< 54 >
Automatic prompt engineering (APE)
• Learn a good prompt automatically
[Large language models are human-level prompt engineers, 2023]
< 55 >
RAG empowered LLM
Retrieval-augmented generation (RAG) helps LLM generate
more precise, up-to-date, and personalized contents.
< 56 >
RAG empowered LLM
RAG Bing Copilot
ChatGPT 3.5
< 57 >
Tool use
Offload tasks that LLM are not good at.
< 58 >
Finetuning
SFT and RLHF are all finetuning
the base pretrained model
< 59 >
LoRA: Low-rank adaptation
Finetune: Inject weights to base model
Fine-tuned weight Base model weight Additional weight
LoRA: low-rank adaptation
Fine-tuned weight Base model weight Low-rank weight
< 60 >
LoRA: Low-rank adaptation
< 61 >
LoRA: Low-rank adaptation
Light but powerful
< 62 >
LoRA: Low-rank adaptation
Reference
E. J. Hu et. al., LoRA: Low-Rank Adaptation of Large Language Models, https://arxiv.org/abs/2106.09685
Q. Zhang et. al., LoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2303.10512
< 63 >
How to use LLM effectively?
Recommendations from OpenAI
< 64 >
Use cases
< 65 >
Course plan
• 1. Preliminary
§ Linear algebra; optimization
§ Machine learning; deep neural network
§ Word embedding; recurrent neural network; Seq2Seq
§ Attention; Transformer;
§ GPT
< 66 >
Course plan
• 2. LLM pretraining
§ SGD
§ Momentum SGD; Adaptive SGD; Adam
§ Large-batch training; mixed-precision training
§ Data parallelism; model parallelism; tensor parallelism
< 67 >
Course plan
• 3. Finetuning
§ Supervised finetuning
§ RLHF
§ Parameter efficient finetuning (PEFT), e.g., LoRA
< 68 >
Course plan
• 4. Prompt engineering
§ Chain of thought; tree of thought
§ Principles to generate high quality prompt
§ Automatic prompt engineering
< 69 >
Course plan
• 5. Applications
§ LLM agent
§ LLM in decision intelligence
< 70 >
Grading policy
• Homework (~30%)
• Mid-term (~30%)
• Final project and presentation (~40%)
< 71 >
Thank you!
Kun Yuan homepage: https://kunyuan827.github.io/