Skip to content

Conversation

@Jonahcb
Copy link

@Jonahcb Jonahcb commented Oct 14, 2025

Purpose

Fix symbol issues (#26843) on SM120 builds where grouped_mm_c3x_sm100.cu is not compiled but ENABLE_CUTLASS_MOE_SM100 was defined by unrelated blocks.

Make ENABLE_CUTLASS_MOE_SM100 accurately reflect SM100 grouped MoE availability:

Defined only on SM100 arches (10.0a/10.1a/10.3a on CUDA 12.8–12.9, or 10.0f on CUDA ≥13.0).

Note: I may be misinterpreting the purpose of ENABLE_CUTLASS_MOE_SM100 as it may be being used not literally to mean only SM100. However, if this is the case, this imprecision should be addressed to fix the code and reduce confusion.

Test Plan

I’m unable to build for all affected architectures; it should be tested on SM90, SM100, SM110, and SM120.

Test Result

Not available.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the ci/build label Oct 14, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes an issue where ENABLE_CUTLASS_MOE_SM100 was being defined incorrectly for non-SM100 builds. The approach of scoping the macro definition to the relevant SM100 kernel builds is sound. However, I've found a critical issue where one of the changes will lead to build failures under certain configurations. I've left a detailed comment with a suggested fix. Besides that one issue, the changes are good and address the problem described.

@Jonahcb
Copy link
Author

Jonahcb commented Oct 15, 2025

Hi! @mgoin thoughts on this:

Note: I may be misinterpreting the purpose of ENABLE_CUTLASS_MOE_SM100 as it may be being used not literally to mean only SM100. However, if this is the case, this imprecision should be addressed to fix the code and reduce confusion.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should only set ENABLE_CUTLASS_MOE_SM100 if we are compiling for 10.x, maybe even more strict. I'm not sure I understand this fix though as some of these cases already specialize for 10.x

CMakeLists.txt Outdated
Comment on lines 725 to 728
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
else()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove 12.x from this block?

Copy link
Author

@Jonahcb Jonahcb Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgoin Yeah, that sounds good. I am relatively new to NVIDIA and can't find much information about this online, but is 11.x compatible with 10.x? If not, shouldn't we remove 11.x as well? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is a mistake

@ZJY0516
Copy link
Contributor

ZJY0516 commented Oct 15, 2025

link #26571 for visibility

@johnnynunez
Copy link
Contributor

@johnnynunez - looks like this recent dependency issue with apache-tvm-ffi. The main branch build is also failing. From the CI job logs:

No solution found when resolving dependencies:
[2025-10-29T05:55:33Z] 1.626   ╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and
[2025-10-29T05:55:33Z] 1.626       flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can
[2025-10-29T05:55:33Z] 1.626       conclude that flashinfer-python==0.4.1 cannot be used.
[2025-10-29T05:55:33Z] 1.626       And because you require flashinfer-python==0.4.1, we can conclude that
[2025-10-29T05:55:33Z] 1.626       your requirements are unsatisfiable.
[2025-10-29T05:55:33Z] 1.626
[2025-10-29T05:55:33Z] 1.626       hint: `apache-tvm-ffi` was requested with a pre-release marker (e.g.,
[2025-10-29T05:55:33Z] 1.626       apache-tvm-ffi==0.1.0b15), but pre-releases weren't enabled (try:
[2025-10-29T05:55:33Z] 1.626       `--prerelease=allow`)

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

@eugr
Copy link

eugr commented Nov 3, 2025

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

This also happens in main, because of dependency on flashinfer==0.4.1 that triggers this apache dependency. There is a new version of flashinfer==0.5.0, I wonder if they should use that instead (assuming no breaking changes).

@johnnynunez
Copy link
Contributor

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

This also happens in main, because of dependency on flashinfer==0.4.1 that triggers this apache dependency. There is a new version of flashinfer==0.5.0, I wonder if they should use that instead (assuming no breaking changes).

there is an open PR. Be patient

@Azure-Tang
Copy link

On Spark, you need two things:

  1. Apply the changes from this PR.
  2. Set environment variables before building VLLM (this is also essential, otherwise cmake will not pick up correct architecture):

export TORCH_CUDA_ARCH_LIST=12.0f export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

I tried to export this two env then compile

but still encounter error.

NotImplementedError: No compiled get_cutlass_moe_mm_data: no cutlass_scaled_mm kernel for CUDA device capability: 121. Required capability: 90 or 100

Is this means we cannot run moe model on spark now? @eugr

@eugr
Copy link

eugr commented Nov 4, 2025

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

@Azure-Tang
Copy link

Azure-Tang commented Nov 4, 2025

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

sure. And I successfully run qwen3-32b. But moe models are keep failing

@johnnynunez
Copy link
Contributor

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

sure. And I successfully run qwen3-30b-a3b. But moe models are keep failing

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

@Azure-Tang
Copy link

Azure-Tang commented Nov 4, 2025

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.

image

@johnnynunez
Copy link
Contributor

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.

image

yes, there is a lack of support for spark right now, we are working on it

@swtb3-ryder
Copy link

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.
image

yes, there is a lack of support for spark right now, we are working on it

@johnnynunez im interested in following developments with spark support, where can I do that?

@trias702
Copy link

trias702 commented Nov 7, 2025

Since there have been some Spark related questions here, I'm wondering if I can please add one:

It seems that doing python -m build --wheel fails on Spark because when it creates the isolated environment for building (based on the toml file) it pulls in the CPU-only version of torch, not the CUDA version, even if you set VLLM_TARGET_DEVICE=cuda. The only way I could get it to successfully build was doing python -m build --wheel --no-isolation because my local install of torch is the CUDA enabled one.

Does anyone here know how I can fix this, such that the isolated build environment pulls in the correct CUDA-enabled torch instead of the CPU-only variant?

@eugr
Copy link

eugr commented Nov 7, 2025

it pulls in the CPU-only version of torch

Run python use_existing_torch.py before building.

@trias702
Copy link

trias702 commented Nov 7, 2025

it pulls in the CPU-only version of torch

Run python use_existing_torch.py before building.

That doesn't work though, because all python use_existing_torch.py does is just erase torch from pyproject.toml. Then when you run python -m build --wheel the isolated environment fails immediately because it doesn't download any torch at all, neither the cpu or cuda one, so you get an ImportError "no module torch".

python -m build -wheel creates an isolated environment for the build, completely separate from the local environment, and it pulls everything in from the toml, so basically just like how uv run works. So if you remove torch from the toml, the isolated build will fail as there's no torch at all anymore.

@eugr
Copy link

eugr commented Nov 7, 2025

Any reason you can't use --no-build-isolation?

@trias702
Copy link

trias702 commented Nov 7, 2025

Any reason you can't use --no-build-isolation?

I can, in which case it works just fine, however, that is only because my local Python environment already mirrors the contents of the toml file. The point of an isolated build is that it is essentially running uv run python -m build --wheel under the bonnet, and this is super useful because it means I don't have to worry if my local python environment matches what's necessary to build. I could have a different local cuda or torch version, whatever. But I can still build vllm any time by creating an isolated environment on the fly that always matches the toml. It's super useful and is the correct way to build complex packages as there is guaranteed no version mismatches.

Only on the DGX Spark it's not working correctly because the toml is pointing to the CPU-only install for torch, and I'm wondering how we change that to be the cuda version instead. As it stands currently, I believe it's a legitimate bug, and should affect any unix box, not just the Spark, but I don't have another unix box with a GPU to know for certain.

@eugr
Copy link

eugr commented Nov 8, 2025

I could have a different local cuda or torch version, whatever.

Won't venv give you that? venv works well with uv, just use uv venv to create it and then activate as usual.

@trias702
Copy link

trias702 commented Nov 8, 2025

I could have a different local cuda or torch version, whatever.

Won't venv give you that? venv works well with uv, just use uv venv to create it and then activate as usual.

venv is what uv uses under the bonnet, same with python build. Unfortunately, the toml file in vllm isn't telling the venv system to use the cuda variant of torch. You need to explicitly tell it to do that, like how Nemo-RL does it: https://github.com/NVIDIA-NeMo/RL/blob/2951ce37ae1e568ab5ff063997da478a85551dfa/pyproject.toml#L185

@eugr
Copy link

eugr commented Nov 9, 2025

@Jonahcb - looks like you need to merge main into your PR for checks to pass...

@johnnynunez
Copy link
Contributor

cc @mgoin

cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
else()
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to break in PTX 9.0 I think. 10.1a is going to be 11.0a at that point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, 11.0f should stay.

@wrmedford
Copy link
Contributor

Can I ask why sm110 is being excluded? It has all of the scaled MM ops of sm100.

@johnnynunez
Copy link
Contributor

@Jonahcb could you resolve the comments? Thanks

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking due to unresolved comments. For now prioritizing #28938

@github-project-automation github-project-automation bot moved this from Ready to In progress in gpt-oss Issues & Enhancements Nov 18, 2025
@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Jonahcb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build needs-rebase nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: No status
Status: No status
Status: In progress

Development

Successfully merging this pull request may close these issues.