feat: GRPO + SFT Dtensor support for multimodal training by rohitrango · Pull Request #712 · NVIDIA-NeMo/RL

rohitrango · 2025-07-22T21:44:20Z

What does this PR do ?

Adds image / video VLM support for supervised finetuning and GRPO using dtensor policy. Solves #85

Tested models:

Qwen2VL / Qwen2.5VL
Llava 1.5 / Llava Next / Llava Next Video / Llava OneVision
Huggingface SmolVLM2-2.2B-Instruct
Gemma3 4B

Tested datasets:

Geometry3k
CLEVR
RefCOCO

🔪 Sharp Edges

Although training runs converge, logprob error between vllm and hf model is higher than 1.05 consistently. Issue tracked in #793 .

Edit: Only in Gemma3. logprob issue is fixed in Llava, SmolVLM, Qwen2, 2.5VL

Usage

uv run examples/run_sft.py --config examples/configs/sft_clevr.yaml  cluster.gpus_per_node=4
uv run examples/run_vlm_grpo.py cluster.gpus_per_node=4

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

github-actions · 2025-07-29T18:31:12Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: cc2986f (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

github-actions · 2025-07-29T18:46:51Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 919a7ce (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

terrykong · 2025-07-29T18:50:51Z

copying over the last message from @rohitrango from #655

re: Remaining blockers:

understanding the logprob error: This is something I want to chalk up to how vllm loads multimodal image embeddings in the image processor. For LLM-only, I noted that vllm takes the same list of token_ids (int value list) that the policy consumes (i.e. going through the same text embedding layer, etc.). However, for multimodal images, vllm processes the images internally. There could also be differences in how sampling is done differently. I found the following excerpt from vllm docs https://docs.vllm.ai/en/v0.9.1/usage/v1_guide.html#feature-model

Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

I prefer handling this issue in a separate PR (and merging an initial support first) for ~~three~~ four reasons:

this discrepancy is isolated to multimodal models only, so a "fix" can be shipped independently
multiple VLMs converge on three different datasets despite the apparent discrepancy. It is equivalent to training GRPO with a slightly off-policy model, but it does not seem to be very unstable or destructive to the learning process
other PRs break multimodal support regularly (every 2-3 days) and I have to rollback / fix those changes in my PR to make my scripts work. Merging this PR or at least the test cases will prevent other PRs from breaking multimodal support
the PR has gotten very big as it is, and adding more fixes will add additional overhead to the review process

PR has now migrated (again) to feat: GRPO + SFT Dtensor support for multimodal training #712, and is tested on 4 families of multimodal models and 3 datasets. This rollsback the passing around of the vlm_kwargs list throughout the training process and instead proposes a PackedGenericDataItem to handle non-sequence data items (most of them would be multimodal tensors). The single implementation seems to work for multiple multimodal models without any additional modifications to the config.

github-actions · 2025-07-29T18:54:21Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 80d9ff5 (PR #712 from rohit/sft_vlm)

❌ Submodules that need attention:

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/terrykong/Megatron-LM/commits/2ff0f099ffc30ffd152e3e29e921a1609d00855c/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/terrykong/Megatron-LM/commits/ed5c792f2a8ffe357c871f4547a8fe905a09b835/

NeMo: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/NeMo/commits/33259f2540af6eef375d43fc48bdcbd7ec490c29/
CURRENT (PR #712 from rohit/sft_vlm): https://github.com/NVIDIA/NeMo/commits/0e0894300e09aca042bc07859f660f22858f0a9f/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

terrykong · 2025-07-29T18:55:10Z

@rohitrango My understanding is that currently the logprob issue may be from input processing not matching inside vllm vs outside. The excerpt you shared is related to sampling, so I think it still remains to be seen whether this is a bug or expected

terrykong · 2025-07-29T18:56:00Z

As far as keeping up with changes from main, if you rebase and encounter conflicts, it's advised to squash your commits since you'll only hit the conflict once as opposed to several times for each hunk in the branch that has touched that area.

rohitrango · 2025-07-29T19:12:09Z

re: keeping up with changes from main , the merge commits are not as big of an issue. The bigger issue is changes that break the multimodal training loop (like adding extra parameters to the dtensor path that is only supported for LLMs, or introducing a model.lm_head somewhere - for VLMs the module would be model.language_model.lm_head), etc.

This basically means I have to debug a working GRPO/SFT training loop every 2 days after merging from main.

The multimodal test cases are expected to block all such changes.

Signed-off-by: rohitrango <[email protected]>

Signed-off-by: Charlie Truong <[email protected]>

…d_message_log Signed-off-by: rohitrango <[email protected]>

Signed-off-by: Yi-Fu Wu <[email protected]>

Signed-off-by: rohitrango <[email protected]>

Signed-off-by: Yi-Fu Wu <[email protected]>

Signed-off-by: rohitrango <[email protected]>

…sft_vlm

Signed-off-by: rohitrango <[email protected]>

Signed-off-by: Yi-Fu Wu <[email protected]>

This reverts commit 60b6e82. Signed-off-by: Yi-Fu Wu <[email protected]>

This reverts commit 2a44965. Signed-off-by: Yi-Fu Wu <[email protected]>

Signed-off-by: Yi-Fu Wu <[email protected]>

rohitrango temporarily deployed to public July 22, 2025 21:44 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 25, 2025 06:11 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 25, 2025 19:55 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 26, 2025 08:34 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 26, 2025 08:38 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 17:48 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 21:18 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 28, 2025 23:18 — with GitHub Actions Inactive

rohitrango changed the base branch from rohit/vlm_grpo to main July 29, 2025 01:49

rohitrango changed the title ~~feat: SFT support for multimodal training (VLM)~~ feat: GRPO + SFT support for multimodal training Jul 29, 2025

rohitrango temporarily deployed to public July 29, 2025 02:35 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 15:27 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 15:39 — with GitHub Actions Inactive

rohitrango mentioned this pull request Jul 29, 2025

feat: v0 VLM support + GRPO pipeline #655

Closed

4 tasks

rohitrango temporarily deployed to public July 29, 2025 17:17 — with GitHub Actions Inactive

rohitrango marked this pull request as ready for review July 29, 2025 18:30

rohitrango temporarily deployed to public July 29, 2025 18:32 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 18:48 — with GitHub Actions Inactive

terrykong changed the title ~~feat: GRPO + SFT support for multimodal training~~ feat: GRPO + SFT Dtensor support for multimodal training Jul 29, 2025

rohitrango temporarily deployed to public July 29, 2025 18:55 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 19:27 — with GitHub Actions Inactive

rohitrango temporarily deployed to public July 29, 2025 21:05 — with GitHub Actions Inactive

rohitrango mentioned this pull request Jul 29, 2025

[multimodal dtensor] Inconsistent logprobs for multimodal models #793

Open

rohitrango and others added 2 commits August 16, 2025 16:21

hotfix for pixtral removed

dbd33fb

Signed-off-by: rohitrango <[email protected]>

Fix lint error in nemo_rl.algorithms.utils

bee7c8a

Signed-off-by: Charlie Truong <[email protected]>

chtruong814 previously approved these changes Aug 17, 2025

View reviewed changes

rohitrango and others added 2 commits August 16, 2025 21:30

corrected unittest, added system prompt consideration in get_formatte…

f1c1b68

…d_message_log Signed-off-by: rohitrango <[email protected]>

ruff format

661e8c7

Signed-off-by: Yi-Fu Wu <[email protected]>

chtruong814 previously approved these changes Aug 17, 2025

View reviewed changes

yfw and others added 8 commits August 18, 2025 01:57

Fix configs + fix sft seed

06f6343

Signed-off-by: Yi-Fu Wu <[email protected]>

ruff

f89200c

Signed-off-by: Yi-Fu Wu <[email protected]>

correct refcoco prompt + add recipes

a6a6b99

Signed-off-by: rohitrango <[email protected]>

Remove default

8fb48bd

Signed-off-by: Yi-Fu Wu <[email protected]>

test suite recipes added

c653902

Signed-off-by: rohitrango <[email protected]>

Merge branch 'rohit/sft_vlm' of github.com:NVIDIA-NeMo/RL into rohit/…

d4e46d3

…sft_vlm

removed extra recipes

e1f0625

Signed-off-by: rohitrango <[email protected]>

Use vlm_grpo for vlm

816c449

Signed-off-by: Yi-Fu Wu <[email protected]>

terrykong previously approved these changes Aug 18, 2025

View reviewed changes

Fix doctest

2a44965

Signed-off-by: Yi-Fu Wu <[email protected]>

yfw previously approved these changes Aug 18, 2025

View reviewed changes

yfw added 10 commits August 19, 2025 00:28

Fix unit tests

0bc2f2a

Signed-off-by: Yi-Fu Wu <[email protected]>

Revert "Fix unit tests"

852470e

This reverts commit 60b6e82. Signed-off-by: Yi-Fu Wu <[email protected]>

Revert "Fix doctest"

11a15c5

This reverts commit 2a44965. Signed-off-by: Yi-Fu Wu <[email protected]>

Remove is_tokenizer_processor

b0a686e

Signed-off-by: Yi-Fu Wu <[email protected]>

ruff

0692d5a

Signed-off-by: Yi-Fu Wu <[email protected]>

Copyright

38cacd2

Signed-off-by: Yi-Fu Wu <[email protected]>

Remove is_tokenizer_processor from TokenizerConfig

db2b3ba

Signed-off-by: Yi-Fu Wu <[email protected]>

Fix seq packing check

c10c9fc

Signed-off-by: Yi-Fu Wu <[email protected]>

Fix reward model check

4aee2a8

Signed-off-by: Yi-Fu Wu <[email protected]>

Move property to top

23e2b78

Signed-off-by: Yi-Fu Wu <[email protected]>

terrykong approved these changes Aug 19, 2025

View reviewed changes

ffrujeri mentioned this pull request Sep 2, 2025

feat: Integrate vlm changes between DTensorPolicyWorker V1 and V2. #982

Merged

4 tasks

qiaochuz-nv mentioned this pull request May 5, 2026

[Chore] Clean up config defaults throughout codebase #863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GRPO + SFT Dtensor support for multimodal training #712

feat: GRPO + SFT Dtensor support for multimodal training #712
terrykong merged 32 commits intomainfrom
rohit/sft_vlm

rohitrango commented Jul 22, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 29, 2025

Uh oh!

github-actions Bot commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

rohitrango commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

rohitrango commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

🔪 Sharp Edges

Usage

Before your PR is "Ready for review"

Uh oh!

github-actions Bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions Bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

terrykong commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 29, 2025

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

rohitrango commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rohitrango commented Jul 22, 2025 •

edited

Loading

terrykong commented Jul 29, 2025 •

edited

Loading