feat: v0 VLM support + GRPO pipeline by rohitrango · Pull Request #655 · NVIDIA-NeMo/RL

rohitrango · 2025-07-11T20:47:05Z

What does this PR do ?

Adds VLM support (Qwen2.5-VL) with TP plan, DTensor Policy, vLLM backend, and multiple gpus.

Usage

uv run examples/run_vlm_grpo.py cluster.gpus_per_node=4

Convergence

(Training) convergence on 2 H100 GPUs happens in about 60 iterations. (highest possible reward is 5)

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: rohitrango <[email protected]>

assertions for non-vlm keys Signed-off-by: rohitrango <[email protected]>

Signed-off-by: rohitrango <[email protected]>

needs testing on larger machine) Signed-off-by: rohitrango <[email protected]>

Signed-off-by: rohitrango <[email protected]>

rohitrango · 2025-07-11T20:48:23Z

@terrykong @ashors1 Created a draft PR (duplicate of #521) to see if CI passes on this instead.

Signed-off-by: rohitrango <[email protected]>

- separated reward functions into separate file (and made composable from YAML files directly) - added RefCOCO task - Ability to freeze huggingface models (language and vision tower) and finegrained freezing using regexes Signed-off-by: rohitrango <[email protected]>

…/vlm_grpo

Signed-off-by: Rohit Jena <[email protected]>

terrykong · 2025-07-17T07:02:01Z

adding @ashors1 @yfw to review/approve

yfw · 2025-07-17T22:34:07Z

+        ## this will have consequences for data sharding for VLM models (split along the batch dim but from [start_patch:end_patch])
+        keys_to_concat = []
+
+        if key in keys_to_concat:


keys_to_concat is always empty here?

yfw · 2025-07-17T22:46:58Z

+
+    if random.random() < img_flip_prob:
+        flip = True
+        resized_image = resized_image.transpose(Image.FLIP_LEFT_RIGHT)


Is this always safe to do for this dataset? Do any captions rely on the positions of the original image (e.g. "A cat sitting to the left of a dog")

ashors1

Thanks for your work on this PR! Two quick comments:

Could you add your test case to the nightly suite: https://github.com/NVIDIA-NeMo/RL/blob/rohit/vlm_grpo/tests/test_suites/nightly.txt?
Throughout the code, there are a number of different ways of getting vlm_keys or vlm_kwargs from the data. This seems slightly verbose, but perhaps I don't have a good enough understanding of the code to see why all these different methods are required. Would it be possible to streamline the process of getting the vlm keys/kwargs? If not, could we add some documentation to explain the structure of the vlm keys in the data? That might help to clarify things a bit

yfw · 2025-07-18T20:45:31Z

+    user_message['token_ids'] = message['input_ids'][0]
+    # add all keys and values to the user message, and the list of keys
+    user_message['vlm_keys'] = []
+    for key, value in message.items():


Are the vlm_keys specific to the dataset? And are they applicable for all messages of that dataset? If so, can this be configured with the dataset? (i.e. in clevr.py and refcoco.py). This seems to assume all keys except for 'input_ids', 'attention_mask' are vlm_keys which seems less safe than if we were explicit about which keys are vlm keys.

Can we get the whitelist of vlm_keys from processor.image_processor.model_input_names ?

- recycle computed vlm_kwargs for both unflattened and flattened batches - remove potentially unsafe code for flipping images in refcoco - rename `get_vlm_keys_from_clippedpgloss_batch` to `get_vlm_keys_from_flattened_batch` and move it to batched_data_dict.py - add vlm grpo testcase to nightly - improve documentation in CLEVR Signed-off-by: rohitrango <[email protected]>

Signed-off-by: rohitrango <[email protected]>

yfw · 2025-07-19T01:46:38Z

I tried a run on the clevr dataset and noticed the token_mult_prob_error was a bit high. This is a measure of the difference in logprobs between vllm and dtensor. For non-vlm, we generally expect this number to be < 1.05 (for qwen 2.5-1b, we see around 1.02). A higher number usually indicates some issue with the refit so vllm and dtensor aren't exactly matching. Is there anything specific about the vlm setup that could cause this?

rohitrango · 2025-07-20T19:39:59Z

I'm not entirely sure what causes the token_mult_prob_error to be relatively higher than the LLM only case. I did not fiddle with the either vllm or the dtensor model forward setup so I'm not sure what causes this.

For the functional test cases, I had to choose higher thresholds for the token_mult_prob_error errors. There are also a few spikes like you have shown that I cannot quite explain (the loss at these iterations does not spike though).

yfw · 2025-07-22T18:26:20Z


+# Add VL model imports
+try:
+    from transformers.models.qwen2_vl.modeling_qwen2_vl import Qwen2VLForConditionalGeneration, Qwen2_5_VLModel


Do we need these try/excepts? Can we instead make sure the transformers version we're using has these?

Signed-off-by: rohitrango <[email protected]>

yfw · 2025-07-22T21:26:51Z

I'm not entirely sure what causes the token_mult_prob_error to be relatively higher than the LLM only case. I did not fiddle with the either vllm or the dtensor model forward setup so I'm not sure what causes this.

For the functional test cases, I had to choose higher thresholds for the token_mult_prob_error errors. There are also a few spikes like you have shown that I cannot quite explain (the loss at these iterations does not spike though).

One thought is that we do the preprocessing of the image before calling the model in the dtensor path whereas vllm does preprocessing of the image internally (if I understand what is happening correctly). We may need to make sure whatever preprocessing vllm is doing matches exactly what we're doing in the dtensor path.

rohitrango · 2025-07-22T21:41:30Z

This will take me a while to analyse since I don't know exactly how the vllm engine processes the images internally.

For the policy, the typical multimodal pipeline is to use the processor to encode the chat template dict into a sequence of text tokens, multimodal tokens indexed by the key pixel_values and metadata image_grid_thw to compute mRoPE embeddings (keys would be different for videos or audio). The pixel_values item takes a PIL Image as input --> processes it into patches.

For vllm, the message log is simply reformatted into the format specified in this tutorial. The same sequence of PIL Images is provided to the vLLM frontend.
I assume the token ids and multimodal tokens must be the same, but I will have to double check. In most cases the logprobs mult error is still close to 1, so it could also be numerical differences between the vllm frontend and the dtensor policy.

Signed-off-by: rohitrango <[email protected]>

terrykong · 2025-07-29T06:27:07Z

From today's meeting, the remaining blockers on this PR:

understanding the logprob error
address review on API VLM processor keys

rohitrango · 2025-07-29T17:06:22Z

re: Remaining blockers:

understanding the logprob error: This is something I want to chalk up to how vllm loads multimodal image embeddings in the image processor. For LLM-only, I noted that vllm takes the same list of token_ids (int value list) that the policy consumes (i.e. going through the same text embedding layer, etc.). However, for multimodal images, vllm processes the images internally. There could also be differences in how sampling is done differently. I found the following excerpt from vllm docs https://docs.vllm.ai/en/v0.9.1/usage/v1_guide.html#feature-model

Logprobs in V1 are now returned immediately once computed from the model’s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty adjustments). As a result, the returned logprobs do not reflect the final adjusted probabilities used during sampling.
Support for logprobs with post-sampling adjustments is in progress and will be added in future updates.

I prefer handling this issue in a separate PR (and merging an initial support first) for ~~three~~ four reasons:

this discrepancy is isolated to multimodal models only, so a "fix" can be shipped independently
multiple VLMs converge on three different datasets despite the apparent discrepancy. It is equivalent to training GRPO with a slightly off-policy model, but it does not seem to be very unstable or destructive to the learning process
other PRs break multimodal support regularly (every 2-3 days) and I have to rollback / fix those changes in my PR to make my scripts work. Merging this PR or at least the test cases will prevent other PRs from breaking multimodal support
the PR has gotten very big as it is, and adding more fixes will add additional overhead to the review process

PR has now migrated (again) to feat: GRPO + SFT Dtensor support for multimodal training #712, and is tested on 4 families of multimodal models and 3 datasets. This rollsback the passing around of the vlm_kwargs list throughout the training process and instead proposes a PackedGenericDataItem to handle non-sequence data items (most of them would be multimodal tensors). The single implementation seems to work for multiple multimodal models without any additional modifications to the config.

rohitrango added 17 commits July 7, 2025 15:13

wip vlm

6f29ac5

Signed-off-by: rohitrango <[email protected]>

i give up (for now)

ae9d0e4

Signed-off-by: rohitrango <[email protected]>

hacked all the way to make it work

bcd6e72

Signed-off-by: rohitrango <[email protected]>

vlm env bug fixes (simple env works now)

6133567

Signed-off-by: rohitrango <[email protected]>

add var to track correct answers

ec31ea6

Signed-off-by: rohitrango <[email protected]>

cleanup v0

e68884d

Signed-off-by: rohitrango <[email protected]>

more tiny changes (stuff works but needs better design)

e622af3

Signed-off-by: rohitrango <[email protected]>

works on multiple gpus, but tp wont be supported due to weight tying

ec358b5

Signed-off-by: rohitrango <[email protected]>

delete bogus entrypoint

e8a7ddf

Signed-off-by: rohitrango <[email protected]>

addressed todos, removed hardcoded keys, and re-enabled seq-length

5070cdb

assertions for non-vlm keys Signed-off-by: rohitrango <[email protected]>

address anna's feedback

67538b2

Signed-off-by: rohitrango <[email protected]>

Merge branch 'rohit-vlm'

e9a3ca6

Signed-off-by: rohitrango <[email protected]>

update CLEVR config

ff07041

Signed-off-by: rohitrango <[email protected]>

functional and testsuite cases added for dtensor policy (testsuite

ce127e8

needs testing on larger machine) Signed-off-by: rohitrango <[email protected]>

change prompt parameters

4204354

Signed-off-by: rohitrango <[email protected]>

move testcases to correct (newly created) dir

f7ea25e

Signed-off-by: rohitrango <[email protected]>

Merge remote-tracking branch 'nvidia/main' into rohit-vlm

eb94e9e

Signed-off-by: rohitrango <[email protected]>

ashors1 added the CI:L1 Run doctests, unit tests, and functional tests label Jul 11, 2025

ashors1 temporarily deployed to nemo-ci July 11, 2025 21:11 — with GitHub Actions Inactive

update megatron commit to avoid np.product errors

e5cdff0

Signed-off-by: rohitrango <[email protected]>

rohitrango temporarily deployed to public July 11, 2025 23:44 — with GitHub Actions Inactive

Merge branch 'main' into rohit/vlm_grpo

5354bca

rohitrango temporarily deployed to public July 12, 2025 00:13 — with GitHub Actions Inactive

rohitrango mentioned this pull request Jul 12, 2025

feat: v0 VLM support + GRPO pipeline #521

Closed

4 tasks

rohitrango added 2 commits July 14, 2025 17:22

More changes

d0fe2e4

- separated reward functions into separate file (and made composable from YAML files directly) - added RefCOCO task - Ability to freeze huggingface models (language and vision tower) and finegrained freezing using regexes Signed-off-by: rohitrango <[email protected]>

Merge branch 'rohit/vlm_grpo' of github.com:NVIDIA-NeMo/RL into rohit…

21051da

…/vlm_grpo

rohitrango marked this pull request as ready for review July 15, 2025 00:23

rohitrango temporarily deployed to public July 15, 2025 00:25 — with GitHub Actions Inactive

added kwargs to rewards

b6c2970

Signed-off-by: Rohit Jena <[email protected]>

ashors1 reviewed Jul 17, 2025

View reviewed changes

Comment thread examples/run_vlm_grpo.py Outdated

ashors1 reviewed Jul 17, 2025

View reviewed changes

Comment thread nemo_rl/data/llm_message_utils.py Outdated

yfw reviewed Jul 17, 2025

View reviewed changes

ashors1 reviewed Jul 17, 2025

View reviewed changes

Comment thread nemo_rl/distributed/batched_data_dict.py Outdated

yfw reviewed Jul 17, 2025

View reviewed changes

Comment thread nemo_rl/data/hf_datasets/clevr.py Outdated

yfw reviewed Jul 17, 2025

View reviewed changes

ashors1 reviewed Jul 18, 2025

View reviewed changes

Comment thread nemo_rl/models/generation/vllm.py Outdated

ashors1 reviewed Jul 18, 2025

View reviewed changes

Comment thread nemo_rl/models/generation/vllm_backend.py Outdated

ashors1 reviewed Jul 18, 2025

View reviewed changes

Comment thread nemo_rl/models/policy/dtensor_policy_worker.py Outdated

ashors1 reviewed Jul 18, 2025

View reviewed changes

Comment thread nemo_rl/models/policy/dtensor_policy_worker.py Outdated

ashors1 requested changes Jul 18, 2025

View reviewed changes

yfw reviewed Jul 18, 2025

View reviewed changes

rohitrango temporarily deployed to public July 18, 2025 22:58 — with GitHub Actions Inactive

Merge branch 'main' of github.com:NVIDIA-NeMo/RL into rohit/vlm_grpo

4dd590a

Signed-off-by: rohitrango <[email protected]>

rohitrango temporarily deployed to public July 18, 2025 23:01 — with GitHub Actions Inactive

yfw reviewed Jul 22, 2025

View reviewed changes

replace try catch block for qwen2[.5]vl import

34c9323

Signed-off-by: rohitrango <[email protected]>

rohitrango temporarily deployed to public July 22, 2025 18:58 — with GitHub Actions Inactive

refactor vlm key getter

c024d7b

Signed-off-by: rohitrango <[email protected]>

rohitrango temporarily deployed to public July 29, 2025 01:18 — with GitHub Actions Inactive

rohitrango closed this Jul 29, 2025

terrykong mentioned this pull request Jul 29, 2025

feat: GRPO + SFT Dtensor support for multimodal training #712

Merged

4 tasks

Conversation

rohitrango commented Jul 11, 2025

What does this PR do ?

Usage

Convergence

Before your PR is "Ready for review"

Additional Information

Uh oh!

rohitrango commented Jul 11, 2025

Uh oh!

terrykong commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

yfw Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yfw Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ashors1 left a comment

Choose a reason for hiding this comment

Uh oh!

yfw Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yfw Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yfw commented Jul 19, 2025

Uh oh!

rohitrango commented Jul 20, 2025

Uh oh!

yfw Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

yfw commented Jul 22, 2025

Uh oh!

rohitrango commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong commented Jul 29, 2025

Uh oh!

rohitrango commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yfw Jul 18, 2025 •

edited

Loading

rohitrango commented Jul 22, 2025 •

edited

Loading

rohitrango commented Jul 29, 2025 •

edited

Loading