Describe the bug
When running GRPO for VLMs (Qwen2.5VL, LLaVa, etc.) the logprobs generated by vllm and that by huggingface differ by a margin higher than 1.05. Although the policy converges across different VLMs.
Steps/Code to reproduce bug
Run uv run examples/run_vlm_grpo.py from PR #712
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
- Environment location: [local / cluster]
- Method of install: [pip install or from source].
Describe the bug
When running GRPO for VLMs (Qwen2.5VL, LLaVa, etc.) the logprobs generated by vllm and that by huggingface differ by a margin higher than 1.05. Although the policy converges across different VLMs.
Steps/Code to reproduce bug
Run
uv run examples/run_vlm_grpo.pyfrom PR #712Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)