0% found this document useful (0 votes)
136 views102 pages

(Slide v3) Tutorial LLM Reasoning

Uploaded by

Vu Huy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views102 pages

(Slide v3) Tutorial LLM Reasoning

Uploaded by

Vu Huy Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

LLMs with Reasoning

LLMs Series

Dinh-Thang Duong – TA
Truong-Binh Duong – STA

Year 2025 Study Materials Demo


Objectives
❖ Content

Cha của Reggie đã cho anh ấy 48 đô la. Training Prompt


Reggie đã mua 5 cuốn sách, mỗi cuốn In this study session, we will discuss about:
có giá x. Reggie còn lại 38 tiền. Giá trị Modify Prompt System
của biến x chưa biết là bao nhiêu?
You are given a problem. Think about
the problem and provide your thought
1. Introduction to LLM with Reasoning.
process. Place it between <thinking>
Thinking ... and </thinking>. Then, provide your
final answer between <answer> and
2. Introduction to prompting techniques to
Step 1 </answer>.
make LLMs reasoning.
Step 2 Question
Thought
3. Introduction to LLM Reasoning.

Cha của Reggie đã cho anh ấy 48 đô la.


Reggie đã chi 48 − 38 = 10 đô Reggie đã mua 5 cuốn sách, mỗi cuốn
la cho 5 cuốn sách, nên lập Step n có giá x. Reggie còn lại 38 tiền. Giá trị
phương trình 5x = 10. Giải ra của biến x chưa biết là bao nhiêu? 4. How to fine-tuning LLMs for Reasoning.
được x = 2. Vậy mỗi cuốn
sách giá 2 đô la.

Output Training
Answer 5. How to implement an LLM Reasoning
Answer 2
2 application for Math Solving.

2
Outline
Ø Introduction
Ø Reasoning through Prompting
Ø LLM Reasoning
Ø Math Solving with LLM Reasoning
Ø Question

3
Introduction

4
Introduction
v Getting Started

Textual Description (Prompt):


Write a python function that receive an
image and plot it using matplotlib
library.

5
Introduction
v What are Large Language Models (LLMs)?

ChatGPT App:

1. User inputs a prompt (greetings,


task…).
2. ChatGPT (Chatbot) outputs a
appropriate response.

6
Introduction
v What are Large Language Models (LLMs)?

LLMs (Large Language Models): AI models (language models) that were trained on a very large corpus of text.
This made them capable of performing various NLP tasks with high precision.
7
https://www.reddit.com/r/AILinksandTools/comments/12c4jmk/a_survey_of_all_llms_so_far_2018_to_2022_a/
Introduction
v What are Large Language Models (LLMs)?

LLMs are often pretrained on a


vast majority of data and
designed to be adaptable to a
wide variety of tasks
(Foundation models).

8
https://blogs.nvidia.com/blog/what-are-foundation-models/
Introduction
v LLMs I/O
Output Text
Input Text

Write a python function that


receive an image and plot it LLMs
using matplotlib library.

Given a “prompt”, LLMs can generate


an appropriate response.

9
Introduction
v Generative AI Prompting

Prompting: Refers to a process of providing


an input, usually in the form of text data, to a
generative AI model to generate a specific
output.

10
https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fpractical-prompt-engineering-74e96130abc4
Introduction
v Prompting in LLMs

With prompting, we can make LLMs do any task with just natural language (zero-shot capability).

11
Introduction
v Getting Started

Example: AI Chatbot using Large Language Models (LLMs).


12
Introduction
v LLMs problem

Question: How can we improve LLMs on a specific task?

13
Introduction
v How to improve LLMs on specific tasks?

In-context learning

How to improve LLMs? Fine-tuning

Augmenting

14
Reasoning through Prompting

15
Reasoning through Prompting
v Introduction

In-context learning (ICL) in LLMs is a technique


where task demonstrations are integrated into the
prompt in a natural language format. This approach
allows pre-trained LLMs to address new tasks
without fine-tuning the model.

16
https://towardsdatascience.com/in-context-learning-approaches-in-large-language-models-9c0c53b116a1
Reasoning through Prompting
v Example
v Make LLMs adapt to a task using
instruction and examples.
Input: 2025-04-20
Output: !04!20!2025
Input: 2024-05-29
In-context examples
Output: !05!29!2024
Input: 2018-07-03
Output: !07!03!2018

Input: 2025-04-23
Test case
Expected Output: !04!23!2025
LLM Response
17
Reasoning through Prompting
v In-context learning type

Zero/One/Few-shot
learning

In-context learning Chain-of-Thought

18
Reasoning through Prompting
v Zero-shot learning

Zero-shot learning: Prompting LLMs to do a task without


any examples.

19
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v One-shot learning

One-shot learning: Prompting LLMs to do a task with an


example.

20
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v Few-shot learning

Few-shot learning: Prompting LLMs to do a task with


more than one example.

21
https://www.hopsworks.ai/dictionary/in-context-learning-icl
Reasoning through Prompting
v Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting enables complex


reasoning capabilities through intermediate reasoning
steps. You can combine it with few-shot prompting to get
better results on more complex tasks that require reasoning
before responding.

22
Reasoning through Prompting
v Chain-of-Thought

Standard Prompting Chain-of-Thought Prompting


Q: An có 7 quyển sách. Bạn ấy mua thêm 4 bộ sách,
Q: An có 7 quyển sách. Bạn ấy mua thêm 4 bộ sách,
mỗi bộ có 2 quyển. Hỏi An có tất cả bao nhiêu quyển
mỗi bộ có 2 quyển. Hỏi An có tất cả bao nhiêu quyển
sách?
sách?
A: An ban đầu có 7 quyển. 4 bộ × 2 quyển = 8 quyển.
A: Đáp án là 15.
Tổng cộng là 7 + 8 = 15 quyển. Đáp án là 15.
Q: Một cửa hàng có 15 chiếc bánh. Bán đi 10 chiếc và
Q: Một cửa hàng có 15 chiếc bánh. Bán đi 10 chiếc và
nhập thêm 5 chiếc mới. Hỏi cửa hàng còn bao nhiêu
nhập thêm 5 chiếc mới. Hỏi cửa hàng còn bao nhiêu
chiếc bánh?
chiếc bánh?

Output Output

A: Cửa hàng có 15 chiếc bánh. Bán 10 chiếc còn 5 chiếc.


A: Đáp án là 20.❌
Nhập thêm 5 chiếc nữa, 5 + 5 = 10. Đáp án là 10.✅

23
Reasoning through Prompting
v Chain-of-Thought
Input

What is the smallest positive perfect cube that can


be written as the sum of three consecutive integers?

Answer: 1008 ❌

Output
Llama-3.2-3B-Instruct

Standard Prompting

24
Reasoning through Prompting
v Chain-of-Thought

25
Reasoning through Prompting
v Chain-of-Thought
Think step by step to solve this question and show Thought
Input
your intermediate reasoning. What is the smallest
positive perfect cube that can be written as the sum
of three consecutive integers?

Let’s consider three consecutive integers: n, n+1, n+2. Their


sum is: n + (n+1) + (n+2) = 3n + 3. …
We want 3n + 3 to be a perfect cube. Try small perfect cubes:
1, 8, 27, 64, … Check which can be written as 3n + 3.

Start with 8: 8 = 3(1) + 5 → not valid.


Next, 27: 27 = 3(3) + 3 → valid! Output
So, 27 is the smallest perfect cube of 3n + 3.
Answer: 27. ✅ Chain-of-Thought Prompting
Llama-3.2-3B-Instruct
26
Reasoning through Prompting
v Chain-of-Thought

27
Reasoning through Prompting
v Self-Consistency
Thought
Input Input Input


… … …


Output 1 Output 2 … Output n

Majority Vote
Output Output Output

Standard Prompting Chain-of-Thought Prompting Self-Consistency with CoT


28
Reasoning through Prompting
v Self-Consistency
Input
What is the smallest positive perfect cube that can
be written as the sum of three consecutive integers? …
… … …
Let’s consider … => … =>
Answer: 27

We want to… => … => Output 1 Output 2 Output n


Answer: 9
Majority Vote
To solve this… => … =>
Output
Answer: 27
Self-Consistency with CoT
Llama-3.2-3B-Instruct
Majority Vote: 27 ✅
29
Reasoning through Prompting
v Tree-of-Thought

Input Input Input


… … …

Output 1 Output 2 Output n



Majority Vote

Output Output Output

CoT Prompting Self-Consistency with CoT Tree-of-Thought

30
Reasoning through Prompting
v Tree-of-Thought
Root (Input)
Input

Expansion

Evaluation

Selection …

Output
Termination
Tree-of-Thought

31
Reasoning through Prompting
v Tree-of-Thought
What is the smallest positive perfect cube that can be written as the sum
of three consecutive integers? ó Find 𝑛! = 3 𝑘 + 1 , 𝑘 ∈ 𝑍 "

Expansion 𝑛=1 𝑛=2 𝑛=3

−2 5
Evaluation ⇒𝑘= ⇒𝑘= ⇒𝑘=8
& Selection 3 3 ✅
❌ ❌

Answer: 𝑛! = 3! = 27

32
LLM Reasoning

33
LLM Reasoning
v DeepSeek

V3 & R1

https://chat.deepseek.com/
34
LLM Reasoning
v DeepSeek-R1-Zero

Without any Supervised Fine-tuning (SFT) data

DeepSeek-V3-Base DeepSeek-R1-Zero
Pretrained MoE 671B 671B

Reinforcement Learning

Group Relative Policy Optimization (GRPO)


with rule-based rewards

35
LLM Reasoning
v DeepSeek-R1-Zero

Prompt Template

A conversation between User and Assistant. The user asks a question, and the
Assistant solves it. The assistant first thinks about the reasoning process in the
mind and then provides the user with the answer. The reasoning process and
answer are enclosed within <think> </think> and <answer> </answer> tags,
respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>.

36
LLM Reasoning
v DeepSeek-R1-Zero

Group Relative Policy Optimization (GRPO)


with rule-based rewards

Format Reward
Accuracy Reward
<think> and </think>
Predict == Ground Truth
<answer> and </answer>

37
LLM Reasoning
v DeepSeek-R1-Zero

https://arxiv.org/pdf/2501.12948 38
LLM Reasoning
v DeepSeek-R1-Zero

Highlights Limitations
• Learns reflection, self-checking, long CoT • Low readability, hard-to-read outputs
• Shows self-evolution and "aha moment" • Language mixing (e.g., English + Chinese)
• First to prove LLMs can learn reasoning via RL only • Not ready for practical use without refinement

https://arxiv.org/pdf/2501.12948 39
LLM Reasoning
v DeepSeek-R1 Pipeline

0086

https://medium.com/@lmpo/deepseek-r1-affordable-efficient-and-state-of-the-art-ai-reasoning-f293b0bd8d65 40
LLM Reasoning
v DeepSeek Non-Reasoning vs DeepSeek Reasoning

• Evaluation: 1000 samples from Vietnamese MetaMathQA


• Example: Markus gấp đôi tuổi con trai ông và con trai
Markus gấp đôi tuổi cháu trai Markus. Nếu tổng số tuổi
của Markus, con trai ông và cháu trai ông là 140 tuổi thì
cháu trai của Markus bao nhiêu tuổi?
• GT Answer: 20

DeepSeek-V2-Lite (16B) DeepSeek-R1-Distill-Qwen (7B)

• Average Time: 10.32s / sample • Average Time: 11.02s / sample


• Accuracy: 4.8% • Accuracy: 59.5%

41
LLM Reasoning
v DeepSeek Non-Reasoning vs DeepSeek Reasoning

• Evaluation: 1000 samples from Vietnamese MetaMathQA


• Example: Markus gấp đôi tuổi con trai ông và con trai Markus gấp
đôi tuổi cháu trai Markus. Nếu tổng số tuổi của Markus, con trai ông
và cháu trai ông là 140 tuổi thì cháu trai của Markus bao nhiêu tuổi?
• GT Answer: 20

DeepSeek-R1-Distill-Qwen (7B)

• Thought: Ta gọi tuổi của Markus là M, con trai ông là B và cháu trai ông là DeepSeek-V2-Lite (16B)
C. Theo đề bài, ta có các quan hệ: M = 2B, B = 2C, và M + B + C = 140.
Thay thế các biểu thức theo C: M = 4C, B = 2C, ta được phương trình 4C + Answer: 30 ❌
2C + C = 140, suy ra 7C = 140 nên C = 20. Vậy cháu trai Markus 20 tuổi.
• Answer: 20 ✅

42
LLM Reasoning
v What is RL?

Reinforcement Learning (RL): A branch of machine


learning where an agent learns to make decisions by
interacting with an environment. Through this interaction,
the agent receives rewards or penalties based on its
actions, guiding it to learn an optimal policy for maximizing
cumulative rewards over time.

43
LLM Reasoning
v What is RL?

Type of learning in Machine Learning


44
LLM Reasoning
v RL idea
Points: 0 Points: 1
+1

Game objective: Get to the big cheese position with highest points.
45
LLM Reasoning
v RL idea

Points: 1 Points: -9
- 10

46
LLM Reasoning
v RL idea
Create an agent that could interact with the environment, learn to
Points: 0 reach the goal and obtain maximum rewards.

Avoid
moving
the right

Reset game and play again with new knowledge.

47
LLM Reasoning
v Example: Supervised Learning Approach

Object Detection Stock Price Prediction Text Classification

Train a model on labeled dataset to be


able to map any X to y.

Supervised Learning Approach 48


LLM Reasoning
v Example: RL Approach

Object Detection Stock Price Prediction Text Classification

Forming the problem by defining Agent,


Environment, State, Action, Reward.

Train the model by let the agent interacts


with the environment.

Reinforcement Learning Approach 49


LLM Reasoning
v RL idea: RL Framework

Reinforcement Learning Problem Formation

50
LLM Reasoning
v Example: Super Mario Bros

51
LLM Reasoning
v Example: Super Mario Bros

Player Enemies

Start Goal
52
LLM Reasoning
v Example: Super Mario Bros

Player: Mario

Moves: Up, Down, Left,


Right

Game Goal:

1. Avoid enemies

2. Reach the flag


Super Mario Bros Game
53
LLM Reasoning
v Agent

Agent: The decision-making entity that learns to


perform actions based on the current state of the
environment.

Example: In Super Mario Bros, the agent is Mario.

54
LLM Reasoning
v Environment

Environment: The context or setting in which the agent


operates, providing feedback to the agent in the form of
rewards or penalties.

Example: In Super Mario Bros:

55
LLM Reasoning
v State
State: Represents the specific situation or configuration
the agent encounters in the environment.

Example: In Super Mario Bros, state can be described by


key game elements:

Current frame raw pixels Objects Coordinate

56
LLM Reasoning
v Action

Action: The set of possible moves or decisions the agent


can take in response to the current state.

Example: In Super Mario Bros, easy version’s legal


actions might includes: Up, Down, Left, Right.

57
LLM Reasoning
v Reward
Reward: A scalar value that quantifies the
desirability of an action given a particular state,
guiding the agent's learning process.

Example: In Super Mario Bros, we can define


reward function as following:
Progress Time Lives

+1 for moving to right -1 for each second taken -100 for losing life
Points or Coins Level Completion

+10 for collecting coins or defeating enemies +500 for reaching the flag
58
LLM Reasoning
v RL Process in Super Mario Bros Agent

State St Perform At
Reward Rt

Return St+1,Rt+1
Reinforcement Learning Framework

Environment 59
LLM Reasoning
v Example: Stock Price Prediction

60
LLM Reasoning
v Example: Stock Price Prediction

In supervised learning, we prepare a training


dataset with features and labels, then build a
model to learn based on it.

61
LLM Reasoning
v Example: Stock Price Prediction

In reinforcement learning, we define the RL


framework for stock price prediction problem, then
employ a learning algorithm to interact and learn
how to achieve best profit.

1. Action: Buy/Sell/Hold.
2. States: All stock prices.
3. Rewards: Profit/Loss.

62
LLM Reasoning
v But what is the training objective?
In supervised learning: In reinforcement learning:

Ø We attempt to minimize the loss between prediction and label. Ø We attempt to maximize the expected cumulative
Ø Minimize the loss function. reward.
Ø Find optimal policy 𝜋.

63
LLM Reasoning
v Policy
Given state S, our agent will
have many possible actions
A.
Points: 0 Ø In RL, we attempt to maximize the expected
cumulative reward.

Need a way so that at every state, the agent could be


able to choose action that leads to the highest
expected cumulative reward.

Possible actions at S0: Right, Down.


Policy 64
LLM Reasoning
v Policy

Points: 0 Points: 0

+1 -10

0 0 0

+1 0 +10

65
LLM Reasoning
v Policy
Points: 0 Points: 0

+1 -10 +11 -9

0 0 0 +11 +11 +11

+1 0 +10 +22 +11 +21

State-value function: expected cumulative return the agent can get if it starts and that state, and act according to the policy

𝑉 0 𝑠 = 𝔼0 𝑅123 + 𝛾𝑅124 + 𝛾 4𝑅125 + ⋯ |𝑆1 = 𝑠


66
LLM Reasoning
v Policy

State S Policy 𝜋(S) Action A

Agent Policy 𝝅: the agent’s behavior, define how agent chooses action
Policy in response to the current state.
State Action
Deterministic Policy Stochastic Policy

Learning
𝑎 = 𝜋(𝑠) 𝜋 𝑎 𝑠 = 𝑃[𝐴|𝑠]
Algorithm

Reward

Environment Optimal policy 𝝅∗ leads to maximize the expected cumulative


67
reward.
LLM Reasoning
v How to find optimal policy?

Two main types of RL methods:


1. Value-based.
2. Policy-based.

68
LLM Reasoning
v Value-based methods

-3 -10

Value-based methods: train the agent to learn


-3 -2 -1 which state is more valuable and take the action
that leads to it.

-2 -1

69
LLM Reasoning
v Policy-based methods

Policy-based methods: train the agent to learn


which action to take, given a state.

70
LLM Reasoning
v RL Algorithms Taxonomy

71
LLM Reasoning
v Introduction Training Prompt
Cha của Reggie đã cho anh ấy 48 đô la.
Reggie đã mua 5 cuốn sách, mỗi cuốn có System
Modify Prompt
giá x. Reggie còn lại 38 tiền. Giá trị của You are given a problem. Think about
biến x chưa biết là bao nhiêu? the problem and provide your thought
process. Place it between <thinking>
and </thinking>. Then, provide your
Thinking ... final answer between <answer> and
</answer>.
Step 1

Step 2 Question


Thought Cha của Reggie đã cho anh ấy 48 đô la.
Reggie đã chi 48 − 38 = 10 đô la Reggie đã mua 5 cuốn sách, mỗi cuốn có
cho 5 cuốn sách, nên lập phương Step n giá x. Reggie còn lại 38 tiền. Giá trị của
trình 5x = 10. Giải ra được x = 2. biến x chưa biết là bao nhiêu?
Vậy mỗi cuốn sách giá 2 đô la.

Output Training Answer


Answer
2
2
72
LLM Reasoning
v Pipeline 1. Setup 3. Prepare 4. Configure
Environment Dataset LoRA

INSTRUCTION
Given a problem, explain
2. Load your reasoning within
Base Model <thinking></thinking>
tags, and provide the final
answer within <answer>
</answer> tags.

Math Problem & Answer

5. Training 6. Save 7. Run


LLM Checkpoints Evaluate

73
LLM Reasoning
v Training Math Reasoning

74
LLM Reasoning
v Step 1: Install and import necessary libraries

Unsloth is an open-source Python library that hand- vLLM is a high-throughput, memory-efficient LLM
writes GPU kernels and patches core ML frameworks inference and serving engine from UC Berkeley, leveraging
to fine-tune large language models up to 2× faster PagedAttention, continuous batching, speculative decoding,
while cutting GPU memory use by 70–80%. and multi-precision quantization support.

75
LLM Reasoning
v Step 1: Install and import necessary libraries

76
LLM Reasoning
v Step 2: Load base model

77
https://llama-2.ai/llama-2-model-details/
LLM Reasoning
v Step 3: Load & Preprocess Dataset

78
LLM Reasoning
v Step 3: Load & Preprocess Dataset
Answers Questions

Vietnamese-meta-math-MetaMathQA-40K-gg-translated Dataset
79
LLM Reasoning [
{
"role": "system",
"content": "You are a helpful assistant that summarizes content clearly."
v Chat-style model: Conversation },
{
[ "role": "user",
{ "content": "Please summarize the following:\n\nMachine learning is a field of AI
"role": "system", that allows computers to learn from data without being explicitly programmed."
"content": "You are a helpful assistant that summarizes },
content clearly." {
}, "role": "assistant",
{ "content": "Machine learning helps computers learn from data automatically,
"role": "user", without needing explicit instructions."
"content": "Please summarize the },
following:\n\nMachine learning is a field of AI that {
allows computers to learn from data without being "role": "user",
explicitly programmed." "content": "Can you also summarize this?\n\nDeep learning is a subset of machine
}, learning that uses neural networks with many layers."
{ },
"role": "assistant", {
"content": "Machine learning helps computers learn "role": "assistant",
from data automatically, without needing explicit "content": "Deep learning is a type of machine learning that uses multi-layered
instructions." neural networks to learn complex patterns from data."
} Single-turn } Multi-turn
] ] 80
LLM Reasoning
v Llama 3.2 Prompt Template
Supported Roles: There are 4 different roles that are supported by Llama text models: system, assistant, user, ipython.

[system, assistant, user, ipython]

Supported Role Description


Sets the context in which to interact with the AI model. It typically includes rules, guidelines, or
system
necessary information that help the model respond effectively.
Represents the human interacting with the model. It includes the inputs, commands, and questions to
user
the model.
A new role introduced in Llama 3.1. Semantically, this role means "tool". This role is used to mark
ipython
messages with the output of a tool call when sent back to the model from the executor.
Represents the response generated by the AI model based on the context provided in
assistant
the system, ipython and user prompts.

81
LLM Reasoning
v Llama 3.2 Prompt Template
Special Tokens Description
<|begin_of_text|> Specifies the start of the prompt.
<|end_of_text|> Model will cease to generate more tokens. This token is generated only by the base models.
<|finetune_right_pad_id|> This token is used for padding text sequences to the same length in a batch.
<|start_header_id|> These tokens enclose the role for a particular message. The possible roles are: [system, user, assistant,
and ipython]
<|end_header_id|>
<|eom_id|> End of message. A message represents a possible stopping point for execution where the model can
inform the executor that a tool call needs to be made. This is used for multi-step interactions between
the model and any available tools. This token is emitted by the model when the Environment:
ipython instruction is used in the system prompt, or if the model calls for a built-in tool.
<|eot_id|> End of turn. Represents when the model has determined that it has finished interacting with the user
message that initiated its response. This is used in two scenarios:
• at the end of a direct interaction between the model and the user
• at the end of multiple interactions between the model and any available tools
This token signals to the executor that the model has finished generating a response.
<|python_tag|> Special tag used in the model’s response to signify a tool call.
82
LLM Reasoning
v Llama 3.2 Prompt Template
Instruct Model Prompt: The format for a regular multi-turn conversation between a user and the model of Llama 3.2.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023 • Each message role clearly marked with
Today Date: 23 July 2024
You are a helpful assistant.<|eot_id|> header tokens.
• <|eot_id|> separates each interaction turn.
<|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|> • System content should define environment,
cut-off date, tone, and rules.
<|start_header_id|>assistant<|end_header_id|>
The capital of France is Paris.

83
LLM Reasoning
v Step 3: Load & Preprocess Dataset

Prompt

84
LLM Reasoning
v Step 3: Load & Preprocess Dataset

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023


Today Date: 28 Apr 2025

You are given a problem.


Think about the problem and provide your thought process.
Place it between <thinking> and </thinking>.
Then, provide your final answer between <answer></answer><|eot_id|>
<|start_header_id|>user<|end_header_id|>

Cha của Reggie đã cho anh ấy 48 đô la. Reggie đã mua 5 cuốn sách, mỗi cuốn có giá x.
Reggie còn lại 38 tiền. Giá trị của biến x chưa biết là bao nhiêu?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

A sample input formatted for the LLaMA chat template.

85
LLM Reasoning
v Group Relative Policy Optimization (GRPO)

Calculate
Advantages

Update Sampling

Question: Nếu Micah uống 1,5 lít nước vào


buổi sáng và uống gấp 3 lần số đó vào buổi
chiều thì tổng lượng nước cô ấy uống từ
sáng đến chiều là bao nhiêu?
Answer: 6
86
LLM Reasoning
v Sampling

Sampling
6
outputs

Question: Nếu Micah uống 1,5 lít nước vào


buổi sáng và uống gấp 3 lần số đó vào buổi
chiều thì tổng lượng nước cô ấy uống từ
sáng đến chiều là bao nhiêu?

87
LLM Reasoning
v Calculate Advantage

5 0 −0.707

6 1 +1.414

4 0 −0.707
Reward 𝑟" Advantage
Question: Nếu Micah uống 1,5 lít nước vào (Accuracy) 𝑟" − 𝜇
-
𝐴" =
buổi sáng và uống gấp 3 lần số đó vào buổi 𝜎
chiều thì tổng lượng nước cô ấy uống từ
sáng đến chiều là bao nhiêu? 𝜇 = 0.333
𝜎 = 0.471

88
LLM Reasoning
v Update

5 0 −0.707

6 1 +1.414

4 0 −0.707
Reward 𝑟" Advantage
(Accuracy) 𝑟" − 𝜇
-
𝐴" =
𝜎

89
LLM Reasoning
v Step 4: Define Reward functions

90
LLM Reasoning
v Step 4: Define Reward functions

91
LLM Reasoning
v Step 5: Training

92
LLM Reasoning
v Step 6: Save & Inference model

93
LLM Reasoning
v Step 6: Save & Inference model

94
LLM Reasoning
v Step 6: Save & Inference model

• Evaluation: 1000 samples from Vietnamese MetaMathQA


• Example: Markus gấp đôi tuổi con trai ông và con trai Markus gấp
đôi tuổi cháu trai Markus. Nếu tổng số tuổi của Markus, con trai ông
và cháu trai ông là 140 tuổi thì cháu trai của Markus bao nhiêu tuổi?
• GT Answer: 20

Llama-3.2-3B-Instruct
Original

• Average Time: 0.71s / sample


• Accuracy: 9%
• Predict: 10 ❌

95
LLM Reasoning
v Step 6: Save & Inference model

• Evaluation: 1000 samples from Vietnamese MetaMathQA


You are given a problem. • Example: Markus gấp đôi tuổi con trai ông và con trai
Think about the problem and provide your thought process. Markus gấp đôi tuổi cháu trai Markus. Nếu tổng số tuổi của
Place it between <thinking> and </thinking>. Markus, con trai ông và cháu trai ông là 140 tuổi thì cháu trai
Then, provide your final answer between <answer></answer> của Markus bao nhiêu tuổi?
• GT Answer: 20

Llama-3.2-3B-Instruct with CoT <thinking>Markus gấp đôi tuổi con trai mình, còn con trai Markus
lại gấp đôi tuổi cháu trai của ông. Gọi tuổi cháu trai là x, thì con trai
• Average Time: 3.2s / sample Markus là 2x và Markus là 4x. Tổng ba người là 140 tuổi nên ta có:
• Accuracy: 11.9% 4x + 2x + x = 140 → 7x = 140 → x = 20.
• Format score: 0.4155 Vậy cháu trai Markus 20 tuổi.</thinking>
• Predict: 20 ✅ <answer>20</answer> ✅

96
LLM Reasoning
v Step 6: Save & Inference model

• Evaluation: 1000 samples from Vietnamese MetaMathQA


You are given a problem. • Example: Markus gấp đôi tuổi con trai ông và con trai
Think about the problem and provide your thought process. Markus gấp đôi tuổi cháu trai Markus. Nếu tổng số tuổi của
Place it between <thinking> and </thinking>. Markus, con trai ông và cháu trai ông là 140 tuổi thì cháu trai
Then, provide your final answer between <answer></answer> của Markus bao nhiêu tuổi?
• GT Answer: 20

Llama-3.2-3B-Instruct
after training GRPO <thinking>Để giải bài toán, ta đặt tuổi cháu trai Markus là x. Khi đó,
tuổi Markus là 2x, còn con trai ông là 4x. Tổng tuổi ba người là x +
• Average Time: 3.6s / sample 2x + 4x = 7x. Biết tổng là 140, ta có 7x = 140 ⇒ x = 20. Vậy cháu
• Accuracy: 31.1% trai Markus 20 tuổi.</thinking>
• Format score: 0.7875 <answer>20</answer> ✅
• Predict: 20 ✅

97
QUIZ

98
Summarization and Q&A

99
Summarization and Q&A

Cha của Reggie đã cho anh ấy 48 đô la.


Reggie đã mua 5 cuốn sách, mỗi cuốn
Training Prompt In this study session, we have discussed
có giá x. Reggie còn lại 38 tiền. Giá trị Modify Prompt System
của biến x chưa biết là bao nhiêu?
You are given a problem. Think about
about:
the problem and provide your thought
process. Place it between <thinking>
and </thinking>. Then, provide your
1. Introduction to LLM with Reasoning.
Thinking ...
final answer between <answer> and
Step 1 </answer>. 2. Introduction to prompting techniques to
Step 2 Question
Thought
make LLMs reasoning.

Cha của Reggie đã cho anh ấy 48 đô la.


Reggie đã chi 48 − 38 = 10 đô Reggie đã mua 5 cuốn sách, mỗi cuốn
la cho 5 cuốn sách, nên lập Step n có giá x. Reggie còn lại 38 tiền. Giá trị 3. Introduction to LLM Reasoning.
phương trình 5x = 10. Giải ra của biến x chưa biết là bao nhiêu?
được x = 2. Vậy mỗi cuốn
sách giá 2 đô la.
Answer
4. Fine-tuning LLMs for Reasoning.
Output Training
Answer 2 5. Implementing an LLM Reasoning
2
application for Math Solving.

100
Summarization and Q&A

?
101
102

You might also like