Guard SM100 CUTLASS MoE macro to SM100 builds #26844

Jonahcb · 2025-10-14T20:13:54Z

Purpose

Fix symbol issues (#26843) on SM120 builds where grouped_mm_c3x_sm100.cu is not compiled but ENABLE_CUTLASS_MOE_SM100 was defined by unrelated blocks.

Make ENABLE_CUTLASS_MOE_SM100 accurately reflect SM100 grouped MoE availability:

Defined only on SM100 arches (10.0a/10.1a/10.3a on CUDA 12.8–12.9, or 10.0f on CUDA ≥13.0).

Note: I may be misinterpreting the purpose of ENABLE_CUTLASS_MOE_SM100 as it may be being used not literally to mean only SM100. However, if this is the case, this imprecision should be addressed to fix the code and reduce confusion.

Test Plan

I’m unable to build for all affected architectures; it should be tested on SM90, SM100, SM110, and SM120.

Test Result

Not available.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-10-14T20:14:04Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request correctly identifies and fixes an issue where ENABLE_CUTLASS_MOE_SM100 was being defined incorrectly for non-SM100 builds. The approach of scoping the macro definition to the relevant SM100 kernel builds is sound. However, I've found a critical issue where one of the changes will lead to build failures under certain configurations. I've left a detailed comment with a suggested fix. Besides that one issue, the changes are good and address the problem described.

CMakeLists.txt

Jonahcb · 2025-10-15T00:29:44Z

Hi! @mgoin thoughts on this:

Note: I may be misinterpreting the purpose of ENABLE_CUTLASS_MOE_SM100 as it may be being used not literally to mean only SM100. However, if this is the case, this imprecision should be addressed to fix the code and reduce confusion.

mgoin

I agree we should only set ENABLE_CUTLASS_MOE_SM100 if we are compiling for 10.x, maybe even more strict. I'm not sure I understand this fix though as some of these cases already specialize for 10.x

mgoin · 2025-10-15T00:43:51Z

CMakeLists.txt

  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
  else()
    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")


Should we remove 12.x from this block?

@mgoin Yeah, that sounds good. I am relatively new to NVIDIA and can't find much information about this online, but is 11.x compatible with 10.x? If not, shouldn't we remove 11.x as well? Thanks!

yes, this is a mistake

ZJY0516 · 2025-10-15T14:44:38Z

link #26571 for visibility

johnnynunez · 2025-11-03T19:54:09Z

@johnnynunez - looks like this recent dependency issue with apache-tvm-ffi. The main branch build is also failing. From the CI job logs:

No solution found when resolving dependencies:
[2025-10-29T05:55:33Z] 1.626   ╰─▶ Because there is no version of apache-tvm-ffi==0.1.0b15 and
[2025-10-29T05:55:33Z] 1.626       flashinfer-python==0.4.1 depends on apache-tvm-ffi==0.1.0b15, we can
[2025-10-29T05:55:33Z] 1.626       conclude that flashinfer-python==0.4.1 cannot be used.
[2025-10-29T05:55:33Z] 1.626       And because you require flashinfer-python==0.4.1, we can conclude that
[2025-10-29T05:55:33Z] 1.626       your requirements are unsatisfiable.
[2025-10-29T05:55:33Z] 1.626
[2025-10-29T05:55:33Z] 1.626       hint: `apache-tvm-ffi` was requested with a pre-release marker (e.g.,
[2025-10-29T05:55:33Z] 1.626       apache-tvm-ffi==0.1.0b15), but pre-releases weren't enabled (try:
[2025-10-29T05:55:33Z] 1.626       `--prerelease=allow`)

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

eugr · 2025-11-03T20:34:59Z

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

This also happens in main, because of dependency on flashinfer==0.4.1 that triggers this apache dependency. There is a new version of flashinfer==0.5.0, I wonder if they should use that instead (assuming no breaking changes).

johnnynunez · 2025-11-03T21:57:24Z

mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release

This also happens in main, because of dependency on flashinfer==0.4.1 that triggers this apache dependency. There is a new version of flashinfer==0.5.0, I wonder if they should use that instead (assuming no breaking changes).

there is an open PR. Be patient

Azure-Tang · 2025-11-04T17:43:54Z

On Spark, you need two things:

Apply the changes from this PR.

Set environment variables before building VLLM (this is also essential, otherwise cmake will not pick up correct architecture):

export TORCH_CUDA_ARCH_LIST=12.0f export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

I tried to export this two env then compile

but still encounter error.

NotImplementedError: No compiled get_cutlass_moe_mm_data: no cutlass_scaled_mm kernel for CUDA device capability: 121. Required capability: 90 or 100

Is this means we cannot run moe model on spark now? @eugr

eugr · 2025-11-04T20:43:57Z

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

Azure-Tang · 2025-11-04T21:29:11Z

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

sure. And I successfully run qwen3-32b. But moe models are keep failing

johnnynunez · 2025-11-04T21:58:40Z

I tried to export this two env then compile

Have you applied the changes from this PR to CMakeLists.txt?

sure. And I successfully run qwen3-30b-a3b. But moe models are keep failing

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

Azure-Tang · 2025-11-04T22:35:01Z

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.

johnnynunez · 2025-11-04T22:37:19Z

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.

yes, there is a lack of support for spark right now, we are working on it

swtb3-ryder · 2025-11-07T10:00:55Z

di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels

I used pytorch 2.9.0. But I don't think thats the case cause the cutlass_scaled_fp4_mm is a cuda kernel? And maybe it seems we don't get some thing like "cutlass_scaled_fp4_mm_sm120f" or "cutlass_scaled_fp4_mm_sm121" in vllm, which related to spark's arch.

yes, there is a lack of support for spark right now, we are working on it

@johnnynunez im interested in following developments with spark support, where can I do that?

trias702 · 2025-11-07T20:29:37Z

Since there have been some Spark related questions here, I'm wondering if I can please add one:

It seems that doing python -m build --wheel fails on Spark because when it creates the isolated environment for building (based on the toml file) it pulls in the CPU-only version of torch, not the CUDA version, even if you set VLLM_TARGET_DEVICE=cuda. The only way I could get it to successfully build was doing python -m build --wheel --no-isolation because my local install of torch is the CUDA enabled one.

Does anyone here know how I can fix this, such that the isolated build environment pulls in the correct CUDA-enabled torch instead of the CPU-only variant?

eugr · 2025-11-07T21:11:44Z

it pulls in the CPU-only version of torch

Run python use_existing_torch.py before building.

trias702 · 2025-11-07T21:34:07Z

it pulls in the CPU-only version of torch

Run python use_existing_torch.py before building.

That doesn't work though, because all python use_existing_torch.py does is just erase torch from pyproject.toml. Then when you run python -m build --wheel the isolated environment fails immediately because it doesn't download any torch at all, neither the cpu or cuda one, so you get an ImportError "no module torch".

python -m build -wheel creates an isolated environment for the build, completely separate from the local environment, and it pulls everything in from the toml, so basically just like how uv run works. So if you remove torch from the toml, the isolated build will fail as there's no torch at all anymore.

eugr · 2025-11-07T21:52:28Z

Any reason you can't use --no-build-isolation?

trias702 · 2025-11-07T23:01:01Z

Any reason you can't use --no-build-isolation?

I can, in which case it works just fine, however, that is only because my local Python environment already mirrors the contents of the toml file. The point of an isolated build is that it is essentially running uv run python -m build --wheel under the bonnet, and this is super useful because it means I don't have to worry if my local python environment matches what's necessary to build. I could have a different local cuda or torch version, whatever. But I can still build vllm any time by creating an isolated environment on the fly that always matches the toml. It's super useful and is the correct way to build complex packages as there is guaranteed no version mismatches.

Only on the DGX Spark it's not working correctly because the toml is pointing to the CPU-only install for torch, and I'm wondering how we change that to be the cuda version instead. As it stands currently, I believe it's a legitimate bug, and should affect any unix box, not just the Spark, but I don't have another unix box with a GPU to know for certain.

eugr · 2025-11-08T00:13:01Z

I could have a different local cuda or torch version, whatever.

Won't venv give you that? venv works well with uv, just use uv venv to create it and then activate as usual.

trias702 · 2025-11-08T01:43:56Z

I could have a different local cuda or torch version, whatever.

Won't venv give you that? venv works well with uv, just use uv venv to create it and then activate as usual.

venv is what uv uses under the bonnet, same with python build. Unfortunately, the toml file in vllm isn't telling the venv system to use the cuda variant of torch. You need to explicitly tell it to do that, like how Nemo-RL does it: https://github.com/NVIDIA-NeMo/RL/blob/2951ce37ae1e568ab5ff063997da478a85551dfa/pyproject.toml#L185

eugr · 2025-11-09T03:56:32Z

@Jonahcb - looks like you need to merge main into your PR for checks to pass...

johnnynunez · 2025-11-11T23:25:55Z

cc @mgoin

wrmedford · 2025-11-13T02:13:42Z

CMakeLists.txt

+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
  else()
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")


This is going to break in PTX 9.0 I think. 10.1a is going to be 11.0a at that point.

Agree, 11.0f should stay.

wrmedford · 2025-11-13T02:14:40Z

Can I ask why sm110 is being excluded? It has all of the scaled MM ops of sm100.

johnnynunez · 2025-11-14T08:49:11Z

@Jonahcb could you resolve the comments? Thanks

mgoin

Blocking due to unresolved comments. For now prioritizing #28938

mergify · 2025-11-19T03:13:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Jonahcb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Jonahcb requested review from LucasWilkinson and tlrmchlsmth as code owners October 14, 2025 20:13

mergify bot added the ci/build label Oct 14, 2025

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

mgoin reviewed Oct 15, 2025

View reviewed changes

Jonahcb force-pushed the fix-moe-macro-guard-sm120 branch from def02d3 to 90ea455 Compare October 15, 2025 16:55

Merge branch 'main' into fix-moe-macro-guard-sm120

16fca72

wrmedford reviewed Nov 13, 2025

View reviewed changes

johnnynunez mentioned this pull request Nov 13, 2025

[Bug]: V1 Engine fails on Blackwell GB10 (SM 12.1): "sink setting not supported" by all compatible attention backends #28589

Open

1 task

johnnynunez mentioned this pull request Nov 18, 2025

[NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 #28938

Merged

mgoin requested changes Nov 18, 2025

View reviewed changes

github-project-automation bot moved this from Ready to In progress in gpt-oss Issues & Enhancements Nov 18, 2025

mergify bot added the needs-rebase label Nov 19, 2025

github-project-automation bot added this to NVIDIA Nov 19, 2025

Uh oh!

Guard SM100 CUTLASS MoE macro to SM100 builds #26844

Are you sure you want to change the base?

Guard SM100 CUTLASS MoE macro to SM100 builds #26844

Conversation

Jonahcb commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Not available.

Uh oh!

github-actions bot commented Oct 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jonahcb commented Oct 15, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Jonahcb Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnnynunez Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ZJY0516 commented Oct 15, 2025

Uh oh!

johnnynunez commented Nov 3, 2025

Uh oh!

eugr commented Nov 3, 2025

Uh oh!

johnnynunez commented Nov 3, 2025

Uh oh!

Azure-Tang commented Nov 4, 2025

Uh oh!

eugr commented Nov 4, 2025

Uh oh!

Azure-Tang commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented Nov 4, 2025

Uh oh!

Azure-Tang commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnnynunez commented Nov 4, 2025

Uh oh!

swtb3-ryder commented Nov 7, 2025

Uh oh!

trias702 commented Nov 7, 2025

Uh oh!

eugr commented Nov 7, 2025

Uh oh!

trias702 commented Nov 7, 2025

Uh oh!

eugr commented Nov 7, 2025

Uh oh!

trias702 commented Nov 7, 2025

Uh oh!

eugr commented Nov 8, 2025

Uh oh!

trias702 commented Nov 8, 2025

Uh oh!

eugr commented Nov 9, 2025

Uh oh!

johnnynunez commented Nov 11, 2025

Uh oh!

wrmedford Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Nov 14, 2025

Jonahcb commented Oct 14, 2025 •

edited by github-actions bot

Loading

Jonahcb Oct 15, 2025 •

edited

Loading

Azure-Tang commented Nov 4, 2025 •

edited

Loading

Azure-Tang commented Nov 4, 2025 •

edited

Loading