Log inSign up
Penghui Qi
206 posts
user avatar
Penghui Qi
@QPHutu
Researcher @SeaAIL PhD student @NUSingapore Working on RL, LLM Reasoning, and MLSys.
Joined August 2022
261
Following
1,421
Followers
  • Pinned
    user avatar
    Penghui Qi
    @QPHutu
    Feb 5
    This time we should say goodbye to PPO/GRPO for real 👋 PPO is a great algorithm in classical RL settings. However, it is fundamentally flawed in LLM regime due to the large, long-tailed vocabulary.💔 Checkout our paper for more details👇
    50K
  • user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
    220K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 1, 2025
    Thanks for this fix. Actually it is not like this easy, GradScaler should be introduced to avoid gradient underflow, otherwise the performance can be even worse than BF16. See: docs.pytorch.org/docs/stable/am… VeRL Example: github.com/sail-sg/Precis…
    This post is unavailable.
    36K
  • user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    ⛈️ VeRL does not natively support FP16 training. A naive implementation will suffer from gradient underflow. 💊 🚀We provide a minimal patch for VeRL to enable effective FP16 training, with about 10 lines of code change.👇 ⌨️
    user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
    Precision-RL/verl_fp16.patch at main · sail-sg/Precision-RL
    From github.com
    18K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 13, 2025
    Finally! Although it's 2.4 slower right now (I believe many optimizations are coming), the results are really promising! It is a huge step towards truly on-policy RL! Amazing work!
    user avatar
    vLLM
    @vllm_project
    Nov 12, 2025
    🚀 No More Train–Inference Mismatch! We demonstrate bitwise consistent on-policy RL with TorchTitan (training) + vLLM (inference) — the first open-source run where training and inference numerics match exactly. It only takes 3 steps: 1️⃣ Make vLLM batch-invariant (same seq →
    15K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 2, 2025
    Indeed many ppl never saw their bf16 training collapse, but the problem exists as in many reports. We reproduce this instability by designing a sanity test (just like MNIST for CV) for better understanding. Large models+datasets are here👇 Give it a try, you may be suprised.
    user avatar
    Zichen Liu
    @zzlccc
    Nov 1, 2025
    Thanks for the thought! Some further thoughts (clarifications): 1. Reasonably designed algorithms (let’s also include precision in the design space) should not collapse on small data. It’s just like if my CNN cannot even overfit MNIST, how can I trust it will master 1000-class
    19K
  • user avatar
    Penghui Qi
    @QPHutu
    May 20, 2025
    👀Optimizing Anytime Reasoning via Budget Relative Policy Optimization👀 🚀Our BRPO leverages verifiable dense rewards, significantly outperforming GRPO in both final and anytime reasoning performance.🚀 📰Paper: arxiv.org/abs/2505.13438 🛠️Code: github.com/sail-sg/Anytim…
    33K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 1, 2025
    This is exactly what we want to share by fp16 tech report! Thanks @Grad62304977 for the great explanation.
    user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Nov 1, 2025
    Replying to @redtachyon
    Well sort of, with just GRPO and not actually taking care of the mismatch at the algorithm level, u will encounter instability with bf16 under normal training settings like here (and as many papers for actual models like Kimi linear have mentioned). Their point is that given
    15K
  • user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    Huge thanks to @Grad62304977 for quickly testing out our findings on using FP16 for RL fine-tuning and confirming the results!🥇
    user avatar
    Grad
    Prime Intellect
    @Grad62304977
    Oct 31, 2025
    Replying to @Grad62304977
    6K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 14, 2025
    Another amazing progress on truly on-policy RL!💯 I believe it is a headache for the community to find a reproducible setting where the mismatch consistently causes training collapse. If so, may check this sanity test. Link to this dataset👇 huggingface.co/datasets/sail/…
    user avatar
    LMSYS Org
    @lmsysorg
    Nov 14, 2025
    💥 We've achieved perfect training-inference alignment for SGLang & FSDP in slime! (Flash Attn 3, DeepGEMM, etc.) The result? A strict KL divergence of 0. But here's the twist: We spent a month trying to find a baseline that crashes from mismatch... and couldn't. 🤷‍♂️ We haven't
    12K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 3, 2025
    Many thanks for these exciting results. I’ve been waiting all weekend for someone to reproduce them, and I’m thrilled they’re here.
    user avatar
    Łukasz Borchmann
    @LukaszBorchmann
    Nov 3, 2025
    Replying to @redtachyon
    Well, not only A100. Here is the sanity check on H200 (GRPO, 32B dense model). The authors also mention that they did some larger-scale experiments on H100.
    8.5K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 2, 2025
    Thank you @karpathy for finding our paper interesting. This is very encouraging.
    user avatar
    Andrej Karpathy
    @karpathy
    Nov 1, 2025
    Replying to @MarFot78 and @zzlccc
    I think if you zoomed into the paper too you’d find it just as if not more interesting.
    7.3K
  • user avatar
    Penghui Qi
    @QPHutu
    Nov 2, 2025
    Replying to @RichardYRLi and @danielhanchen
    Hi @RichardYRLi , I tried this disable_cascade_attn many times, including the latest vllm version. But unfortunately it made no difference in our experiments. So I guess it really depends on the setting.
    6K
  • user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    Replying to @QPHutu
    4.2K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up