(Slide v3) Tutorial LLM Reasoning
(Slide v3) Tutorial LLM Reasoning
LLMs Series
Dinh-Thang Duong – TA
Truong-Binh Duong – STA
Output Training
Answer 5. How to implement an LLM Reasoning
Answer 2
2 application for Math Solving.
2
Outline
Ø Introduction
Ø Reasoning through Prompting
Ø LLM Reasoning
Ø Math Solving with LLM Reasoning
Ø Question
3
Introduction
4
Introduction
v Getting Started
5
Introduction
v What are Large Language Models (LLMs)?
ChatGPT App:
6
Introduction
v What are Large Language Models (LLMs)?
LLMs (Large Language Models): AI models (language models) that were trained on a very large corpus of text.
This made them capable of performing various NLP tasks with high precision.
7
https://www.reddit.com/r/AILinksandTools/comments/12c4jmk/a_survey_of_all_llms_so_far_2018_to_2022_a/
Introduction
v What are Large Language Models (LLMs)?
8
https://blogs.nvidia.com/blog/what-are-foundation-models/
Introduction
v LLMs I/O
Output Text
Input Text
9
Introduction
v Generative AI Prompting
10
https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fpractical-prompt-engineering-74e96130abc4
Introduction
v Prompting in LLMs
With prompting, we can make LLMs do any task with just natural language (zero-shot capability).
11
Introduction
v Getting Started
13
Introduction
v How to improve LLMs on specific tasks?
In-context learning
Augmenting
14
Reasoning through Prompting
15
Reasoning through Prompting
v Introduction
16
https://towardsdatascience.com/in-context-learning-approaches-in-large-language-models-9c0c53b116a1
Reasoning through Prompting
v Example
v Make LLMs adapt to a task using
instruction and examples.
Input: 2025-04-20
Output: !04!20!2025
Input: 2024-05-29
In-context examples
Output: !05!29!2024
Input: 2018-07-03
Output: !07!03!2018
Input: 2025-04-23
Test case
Expected Output: !04!23!2025
LLM Response
17
Reasoning through Prompting
v In-context learning type
Zero/One/Few-shot
learning
18
Reasoning through Prompting
v Zero-shot learning
19
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v One-shot learning
20
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v Few-shot learning
21
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v Chain-of-Thought Prompting
22
Reasoning through Prompting
v Chain-of-Thought
Output Output
23
Reasoning through Prompting
v Chain-of-Thought
Input
Answer: 1008 ❌
Output
Llama-3.2-3B-Instruct
Standard Prompting
24
Reasoning through Prompting
v Chain-of-Thought
25
Reasoning through Prompting
v Chain-of-Thought
Think step by step to solve this question and show Thought
Input
your intermediate reasoning. What is the smallest
positive perfect cube that can be written as the sum
of three consecutive integers?
27
Reasoning through Prompting
v Self-Consistency
Thought
Input Input Input
…
… … …
…
Output 1 Output 2 … Output n
Majority Vote
Output Output Output
…
… … …
30
Reasoning through Prompting
v Tree-of-Thought
Root (Input)
Input
Expansion
Evaluation
Selection …
Output
Termination
Tree-of-Thought
31
Reasoning through Prompting
v Tree-of-Thought
What is the smallest positive perfect cube that can be written as the sum
of three consecutive integers? ó Find 𝑛! = 3 𝑘 + 1 , 𝑘 ∈ 𝑍 "
−2 5
Evaluation ⇒𝑘= ⇒𝑘= ⇒𝑘=8
& Selection 3 3 ✅
❌ ❌
Answer: 𝑛! = 3! = 27
32
LLM Reasoning
33
LLM Reasoning
v DeepSeek
V3 & R1
https://chat.deepseek.com/
34
LLM Reasoning
v DeepSeek-R1-Zero
DeepSeek-V3-Base DeepSeek-R1-Zero
Pretrained MoE 671B 671B
Reinforcement Learning
35
LLM Reasoning
v DeepSeek-R1-Zero
Prompt Template
A conversation between User and Assistant. The user asks a question, and the
Assistant solves it. The assistant first thinks about the reasoning process in the
mind and then provides the user with the answer. The reasoning process and
answer are enclosed within <think> </think> and <answer> </answer> tags,
respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>.
36
LLM Reasoning
v DeepSeek-R1-Zero
Format Reward
Accuracy Reward
<think> and </think>
Predict == Ground Truth
<answer> and </answer>
37
LLM Reasoning
v DeepSeek-R1-Zero
https://arxiv.org/pdf/2501.12948 38
LLM Reasoning
v DeepSeek-R1-Zero
Highlights Limitations
• Learns reflection, self-checking, long CoT • Low readability, hard-to-read outputs
• Shows self-evolution and "aha moment" • Language mixing (e.g., English + Chinese)
• First to prove LLMs can learn reasoning via RL only • Not ready for practical use without refinement
https://arxiv.org/pdf/2501.12948 39
LLM Reasoning
v DeepSeek-R1 Pipeline
0086
https://medium.com/@lmpo/deepseek-r1-affordable-efficient-and-state-of-the-art-ai-reasoning-f293b0bd8d65 40
LLM Reasoning
v DeepSeek Non-Reasoning vs DeepSeek Reasoning
41
LLM Reasoning
v DeepSeek Non-Reasoning vs DeepSeek Reasoning
DeepSeek-R1-Distill-Qwen (7B)
• Thought: Ta gọi tuổi của Markus là M, con trai ông là B và cháu trai ông là DeepSeek-V2-Lite (16B)
C. Theo đề bài, ta có các quan hệ: M = 2B, B = 2C, và M + B + C = 140.
Thay thế các biểu thức theo C: M = 4C, B = 2C, ta được phương trình 4C + Answer: 30 ❌
2C + C = 140, suy ra 7C = 140 nên C = 20. Vậy cháu trai Markus 20 tuổi.
• Answer: 20 ✅
42
LLM Reasoning
v What is RL?
43
LLM Reasoning
v What is RL?
Game objective: Get to the big cheese position with highest points.
45
LLM Reasoning
v RL idea
Points: 1 Points: -9
- 10
46
LLM Reasoning
v RL idea
Create an agent that could interact with the environment, learn to
Points: 0 reach the goal and obtain maximum rewards.
Avoid
moving
the right
47
LLM Reasoning
v Example: Supervised Learning Approach
50
LLM Reasoning
v Example: Super Mario Bros
51
LLM Reasoning
v Example: Super Mario Bros
Player Enemies
Start Goal
52
LLM Reasoning
v Example: Super Mario Bros
Player: Mario
Game Goal:
1. Avoid enemies
54
LLM Reasoning
v Environment
55
LLM Reasoning
v State
State: Represents the specific situation or configuration
the agent encounters in the environment.
56
LLM Reasoning
v Action
57
LLM Reasoning
v Reward
Reward: A scalar value that quantifies the
desirability of an action given a particular state,
guiding the agent's learning process.
+1 for moving to right -1 for each second taken -100 for losing life
Points or Coins Level Completion
+10 for collecting coins or defeating enemies +500 for reaching the flag
58
LLM Reasoning
v RL Process in Super Mario Bros Agent
State St Perform At
Reward Rt
Return St+1,Rt+1
Reinforcement Learning Framework
Environment 59
LLM Reasoning
v Example: Stock Price Prediction
60
LLM Reasoning
v Example: Stock Price Prediction
61
LLM Reasoning
v Example: Stock Price Prediction
1. Action: Buy/Sell/Hold.
2. States: All stock prices.
3. Rewards: Profit/Loss.
62
LLM Reasoning
v But what is the training objective?
In supervised learning: In reinforcement learning:
Ø We attempt to minimize the loss between prediction and label. Ø We attempt to maximize the expected cumulative
Ø Minimize the loss function. reward.
Ø Find optimal policy 𝜋.
63
LLM Reasoning
v Policy
Given state S, our agent will
have many possible actions
A.
Points: 0 Ø In RL, we attempt to maximize the expected
cumulative reward.
Points: 0 Points: 0
+1 -10
0 0 0
+1 0 +10
65
LLM Reasoning
v Policy
Points: 0 Points: 0
+1 -10 +11 -9
State-value function: expected cumulative return the agent can get if it starts and that state, and act according to the policy
Agent Policy 𝝅: the agent’s behavior, define how agent chooses action
Policy in response to the current state.
State Action
Deterministic Policy Stochastic Policy
Learning
𝑎 = 𝜋(𝑠) 𝜋 𝑎 𝑠 = 𝑃[𝐴|𝑠]
Algorithm
Reward
68
LLM Reasoning
v Value-based methods
-3 -10
-2 -1
69
LLM Reasoning
v Policy-based methods
70
LLM Reasoning
v RL Algorithms Taxonomy
71
LLM Reasoning
v Introduction Training Prompt
Cha của Reggie đã cho anh ấy 48 đô la.
Reggie đã mua 5 cuốn sách, mỗi cuốn có System
Modify Prompt
giá x. Reggie còn lại 38 tiền. Giá trị của You are given a problem. Think about
biến x chưa biết là bao nhiêu? the problem and provide your thought
process. Place it between <thinking>
and </thinking>. Then, provide your
Thinking ... final answer between <answer> and
</answer>.
Step 1
Step 2 Question
…
Thought Cha của Reggie đã cho anh ấy 48 đô la.
Reggie đã chi 48 − 38 = 10 đô la Reggie đã mua 5 cuốn sách, mỗi cuốn có
cho 5 cuốn sách, nên lập phương Step n giá x. Reggie còn lại 38 tiền. Giá trị của
trình 5x = 10. Giải ra được x = 2. biến x chưa biết là bao nhiêu?
Vậy mỗi cuốn sách giá 2 đô la.
INSTRUCTION
Given a problem, explain
2. Load your reasoning within
Base Model <thinking></thinking>
tags, and provide the final
answer within <answer>
</answer> tags.
73
LLM Reasoning
v Training Math Reasoning
74
LLM Reasoning
v Step 1: Install and import necessary libraries
Unsloth is an open-source Python library that hand- vLLM is a high-throughput, memory-efficient LLM
writes GPU kernels and patches core ML frameworks inference and serving engine from UC Berkeley, leveraging
to fine-tune large language models up to 2× faster PagedAttention, continuous batching, speculative decoding,
while cutting GPU memory use by 70–80%. and multi-precision quantization support.
75
LLM Reasoning
v Step 1: Install and import necessary libraries
76
LLM Reasoning
v Step 2: Load base model
77
https://llama-2.ai/llama-2-model-details/
LLM Reasoning
v Step 3: Load & Preprocess Dataset
78
LLM Reasoning
v Step 3: Load & Preprocess Dataset
Answers Questions
Vietnamese-meta-math-MetaMathQA-40K-gg-translated Dataset
79
LLM Reasoning [
{
"role": "system",
"content": "You are a helpful assistant that summarizes content clearly."
v Chat-style model: Conversation },
{
[ "role": "user",
{ "content": "Please summarize the following:\n\nMachine learning is a field of AI
"role": "system", that allows computers to learn from data without being explicitly programmed."
"content": "You are a helpful assistant that summarizes },
content clearly." {
}, "role": "assistant",
{ "content": "Machine learning helps computers learn from data automatically,
"role": "user", without needing explicit instructions."
"content": "Please summarize the },
following:\n\nMachine learning is a field of AI that {
allows computers to learn from data without being "role": "user",
explicitly programmed." "content": "Can you also summarize this?\n\nDeep learning is a subset of machine
}, learning that uses neural networks with many layers."
{ },
"role": "assistant", {
"content": "Machine learning helps computers learn "role": "assistant",
from data automatically, without needing explicit "content": "Deep learning is a type of machine learning that uses multi-layered
instructions." neural networks to learn complex patterns from data."
} Single-turn } Multi-turn
] ] 80
LLM Reasoning
v Llama 3.2 Prompt Template
Supported Roles: There are 4 different roles that are supported by Llama text models: system, assistant, user, ipython.
81
LLM Reasoning
v Llama 3.2 Prompt Template
Special Tokens Description
<|begin_of_text|> Specifies the start of the prompt.
<|end_of_text|> Model will cease to generate more tokens. This token is generated only by the base models.
<|finetune_right_pad_id|> This token is used for padding text sequences to the same length in a batch.
<|start_header_id|> These tokens enclose the role for a particular message. The possible roles are: [system, user, assistant,
and ipython]
<|end_header_id|>
<|eom_id|> End of message. A message represents a possible stopping point for execution where the model can
inform the executor that a tool call needs to be made. This is used for multi-step interactions between
the model and any available tools. This token is emitted by the model when the Environment:
ipython instruction is used in the system prompt, or if the model calls for a built-in tool.
<|eot_id|> End of turn. Represents when the model has determined that it has finished interacting with the user
message that initiated its response. This is used in two scenarios:
• at the end of a direct interaction between the model and the user
• at the end of multiple interactions between the model and any available tools
This token signals to the executor that the model has finished generating a response.
<|python_tag|> Special tag used in the model’s response to signify a tool call.
82
LLM Reasoning
v Llama 3.2 Prompt Template
Instruct Model Prompt: The format for a regular multi-turn conversation between a user and the model of Llama 3.2.
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023 • Each message role clearly marked with
Today Date: 23 July 2024
You are a helpful assistant.<|eot_id|> header tokens.
• <|eot_id|> separates each interaction turn.
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|> • System content should define environment,
cut-off date, tone, and rules.
<|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris.
83
LLM Reasoning
v Step 3: Load & Preprocess Dataset
Prompt
84
LLM Reasoning
v Step 3: Load & Preprocess Dataset
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cha của Reggie đã cho anh ấy 48 đô la. Reggie đã mua 5 cuốn sách, mỗi cuốn có giá x.
Reggie còn lại 38 tiền. Giá trị của biến x chưa biết là bao nhiêu?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
85
LLM Reasoning
v Group Relative Policy Optimization (GRPO)
Calculate
Advantages
Update Sampling
Sampling
6
outputs
87
LLM Reasoning
v Calculate Advantage
5 0 −0.707
6 1 +1.414
4 0 −0.707
Reward 𝑟" Advantage
Question: Nếu Micah uống 1,5 lít nước vào (Accuracy) 𝑟" − 𝜇
-
𝐴" =
buổi sáng và uống gấp 3 lần số đó vào buổi 𝜎
chiều thì tổng lượng nước cô ấy uống từ
sáng đến chiều là bao nhiêu? 𝜇 = 0.333
𝜎 = 0.471
88
LLM Reasoning
v Update
5 0 −0.707
6 1 +1.414
4 0 −0.707
Reward 𝑟" Advantage
(Accuracy) 𝑟" − 𝜇
-
𝐴" =
𝜎
89
LLM Reasoning
v Step 4: Define Reward functions
90
LLM Reasoning
v Step 4: Define Reward functions
91
LLM Reasoning
v Step 5: Training
92
LLM Reasoning
v Step 6: Save & Inference model
93
LLM Reasoning
v Step 6: Save & Inference model
94
LLM Reasoning
v Step 6: Save & Inference model
Llama-3.2-3B-Instruct
Original
95
LLM Reasoning
v Step 6: Save & Inference model
Llama-3.2-3B-Instruct with CoT <thinking>Markus gấp đôi tuổi con trai mình, còn con trai Markus
lại gấp đôi tuổi cháu trai của ông. Gọi tuổi cháu trai là x, thì con trai
• Average Time: 3.2s / sample Markus là 2x và Markus là 4x. Tổng ba người là 140 tuổi nên ta có:
• Accuracy: 11.9% 4x + 2x + x = 140 → 7x = 140 → x = 20.
• Format score: 0.4155 Vậy cháu trai Markus 20 tuổi.</thinking>
• Predict: 20 ✅ <answer>20</answer> ✅
96
LLM Reasoning
v Step 6: Save & Inference model
Llama-3.2-3B-Instruct
after training GRPO <thinking>Để giải bài toán, ta đặt tuổi cháu trai Markus là x. Khi đó,
tuổi Markus là 2x, còn con trai ông là 4x. Tổng tuổi ba người là x +
• Average Time: 3.6s / sample 2x + 4x = 7x. Biết tổng là 140, ta có 7x = 140 ⇒ x = 20. Vậy cháu
• Accuracy: 31.1% trai Markus 20 tuổi.</thinking>
• Format score: 0.7875 <answer>20</answer> ✅
• Predict: 20 ✅
97
QUIZ
98
Summarization and Q&A
99
Summarization and Q&A
100
Summarization and Q&A
?
101
102