Log inSign up
Zichen Liu
588 posts
user avatar
Zichen Liu
@zzlccc
Gemini RL @GoogleDeepMind
Singapore
lkevinzc.github.io
Joined October 2021
459
Following
6,057
Followers
  • Pinned
    user avatar
    Zichen Liu
    @zzlccc
    Mar 21, 2025
    🪂Understanding R1-Zero-Like Training: A Critical Perspective * DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning?? * The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO?? * Getting GRPO Done Right, we achieve a 7B AIME sota! 🧵 📜Full
    331K
  • user avatar
    Zichen Liu
    @zzlccc
    Nov 1, 2025
    Super excited that @karpathy noticed our work! Hopefully it helps the broader community realize that *precision* deserves a place in our design space.
    279K
  • user avatar
    Zichen Liu
    @zzlccc
    Oct 2, 2025
    much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/bl…
    user avatar
    Thinking Machines
    @thinkymachines
    Sep 29, 2025
    LoRA makes fine-tuning more accessible, but it's unclear how it compares to full fine-tuning. We find that the performance often matches closely---more often than you might expect. In our latest Connectionism post, we share our experimental results and recommendations for LoRA.
    205K
  • user avatar
    Zichen Liu
    @zzlccc
    Sep 22, 2025
    exactly. and we will never derive a term like 1/|o|. seeing so many papers still using the original GRPO is sad.
    user avatar
    Nan Jiang
    @nanjiang_cs
    Sep 21, 2025
    I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.
    62K
  • user avatar
    Zichen Liu
    @zzlccc
    Oct 25, 2025
    Nothing feels more exciting than writing a thesis proposal on RL for LLMs before 2025 ends!! Covering a subset of my first-author works done in the past 1.5 years (after switching from traditional RL to LLM RL…) Tentative title, of course
    61K
  • user avatar
    Zichen Liu
    @zzlccc
    Oct 31, 2025
    BF16 -> FP16 is such a simple (one configuration change in Oat) yet fundamental fix for inference-training mismatch. With FP16, the most basic importance sampling PG outperforms all algorithmic fixes in BF16. Let's rethink RL stability from the precision perspective.🔎
    user avatar
    Penghui Qi
    @QPHutu
    Oct 31, 2025
    🚀Excited to share our new work! 💊Problem: The BF16 precision causes a large training-inference mismatch, leading to unstable RL training. 💡Solution: Just switch to FP16. 🎯That's it. 📰Paper: arxiv.org/pdf/2510.26788 ⭐️Code: github.com/sail-sg/Precis…
    78K
  • user avatar
    Zichen Liu
    @zzlccc
    Feb 6, 2025
    🚨There May Not be Aha Moment in R1-Zero-like Training: oatllm.notion.site/oat-zero A common belief about the recent R1-Zero-like training is that self-reflections *emerge* as a result of RL training. We carefully investigated and showed the opposite. 🧵
    117K
  • user avatar
    Zichen Liu
    @zzlccc
    Aug 22, 2025
    With just a few lines of code, Feng’s (@fengyao1909) suggested fix—applying importance sampling on the behavior policy—resolved the training instability in my case (oat). I believe the result can generalize to other RL frameworks as well. Great work, Feng!
    45K
  • user avatar
    Zichen Liu
    @zzlccc
    Oct 3, 2025
    6 months after our paper release, I still recall the debates on removing the length normalization term in DrGRPO. And people gradually think DrGRPO is just about removing the std, ignoring the most important and subtle (length) bias we tried to point out to the community. Even
    41K
  • user avatar
    Zichen Liu
    @zzlccc
    Jul 27, 2025
    Learning GSPO proposed by Qwen team: fig 1. they propose to use sequence likelihood for importance sampling fig 2. but from the RL course by @svlevine, this is the original form of off-policy PG fig 3. per-token IS in (Dr) GRPO is an approximation of it Am I missing anything?
    63K
  • user avatar
    Zichen Liu
    @zzlccc
    Mar 22, 2025
    Good catch! But in fact this correction is unnecessary. We were aware of this. The N/N-1 factor affects all training instances equally, thus can be compensated by adapting the learning rate. Their gradients are the same after compensation. We have acknowledged the connection
    user avatar
    leloy!
    @leloykun
    Mar 22, 2025
    I'm not sure if someone has already pointed this out, but Dr. GRPO still has a bias that is more pronounced the smaller the group size is. To make it unbiased, simply multiply Dr. GRPO's A_i by the correction term N/N-1. With this, you'll get LOOP (Leave-One-Out Proximal Policy
    45K
  • user avatar
    Zichen Liu
    @zzlccc
    Oct 6, 2025
    GEM❤️Tinker GEM, an environment suite with a unified interface, works perfectly with Tinker, the API by @thinkymachines that handles the heavy lifting of distributed training. In our latest release of GEM, we 1. supported Tinker and 5 more RL training frameworks 2. reproduced
    58K
  • user avatar
    Zichen Liu
    @zzlccc
    Aug 1, 2025
    In the era of experience, we're training LLM agents with RL — but something's missing... We miss the good old Gym! So we built 💎GEM: a suite of environments for training LLM 𝚐𝚎𝚗𝚎𝚛𝚊𝚕𝚒𝚜𝚝𝚜. Let’s build the Gym for LLMs, together: axon-rl.notion.site/gem
    45K
  • user avatar
    Zichen Liu
    @zzlccc
    Mar 26, 2025
    Since the release of Dr. GRPO, many are interested in the 𝐥𝐞𝐧𝐠𝐭𝐡 𝐛𝐢𝐚𝐬 in GRPO's formulation & implementation, as well as in PPO's implementations. I did some updates on our paper and prepared a table for better comparison (details in thread):
    22K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up