-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Guard SM100 CUTLASS MoE macro to SM100 builds #26844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies and fixes an issue where ENABLE_CUTLASS_MOE_SM100 was being defined incorrectly for non-SM100 builds. The approach of scoping the macro definition to the relevant SM100 kernel builds is sound. However, I've found a critical issue where one of the changes will lead to build failures under certain configurations. I've left a detailed comment with a suggested fix. Besides that one issue, the changes are good and address the problem described.
|
Hi! @mgoin thoughts on this:
|
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should only set ENABLE_CUTLASS_MOE_SM100 if we are compiling for 10.x, maybe even more strict. I'm not sure I understand this fix though as some of these cases already specialize for 10.x
CMakeLists.txt
Outdated
| if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0) | ||
| cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}") | ||
| else() | ||
| cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove 12.x from this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin Yeah, that sounds good. I am relatively new to NVIDIA and can't find much information about this online, but is 11.x compatible with 10.x? If not, shouldn't we remove 11.x as well? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is a mistake
|
link #26571 for visibility |
def02d3 to
90ea455
Compare
mmm maybe he has to merge main to his PR due that apache tvm ffi now has stable release |
This also happens in main, because of dependency on flashinfer==0.4.1 that triggers this apache dependency. There is a new version of flashinfer==0.5.0, I wonder if they should use that instead (assuming no breaking changes). |
there is an open PR. Be patient |
I tried to export this two env then compile but still encounter error. Is this means we cannot run moe model on spark now? @eugr |
Have you applied the changes from this PR to CMakeLists.txt? |
sure. And I successfully run qwen3-32b. But moe models are keep failing |
di you tried new pytorch 2.9.1 and triton 3.5.1? That fix some issues for spark related to triton kernels |
@johnnynunez im interested in following developments with spark support, where can I do that? |
|
Since there have been some Spark related questions here, I'm wondering if I can please add one: It seems that doing Does anyone here know how I can fix this, such that the isolated build environment pulls in the correct CUDA-enabled torch instead of the CPU-only variant? |
Run |
That doesn't work though, because all
|
|
Any reason you can't use --no-build-isolation? |
I can, in which case it works just fine, however, that is only because my local Python environment already mirrors the contents of the toml file. The point of an isolated build is that it is essentially running Only on the DGX Spark it's not working correctly because the toml is pointing to the CPU-only install for torch, and I'm wondering how we change that to be the cuda version instead. As it stands currently, I believe it's a legitimate bug, and should affect any unix box, not just the Spark, but I don't have another unix box with a GPU to know for certain. |
Won't venv give you that? venv works well with uv, just use uv venv to create it and then activate as usual. |
venv is what uv uses under the bonnet, same with python build. Unfortunately, the toml file in vllm isn't telling the venv system to use the cuda variant of torch. You need to explicitly tell it to do that, like how Nemo-RL does it: https://github.com/NVIDIA-NeMo/RL/blob/2951ce37ae1e568ab5ff063997da478a85551dfa/pyproject.toml#L185 |
|
@Jonahcb - looks like you need to merge main into your PR for checks to pass... |
|
cc @mgoin |
| cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}") | ||
| else() | ||
| cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}") | ||
| cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break in PTX 9.0 I think. 10.1a is going to be 11.0a at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, 11.0f should stay.
|
Can I ask why sm110 is being excluded? It has all of the scaled MM ops of sm100. |
|
@Jonahcb could you resolve the comments? Thanks |
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking due to unresolved comments. For now prioritizing #28938
|
This pull request has merge conflicts that must be resolved before it can be |


Purpose
Fix symbol issues (#26843) on SM120 builds where grouped_mm_c3x_sm100.cu is not compiled but ENABLE_CUTLASS_MOE_SM100 was defined by unrelated blocks.
Make ENABLE_CUTLASS_MOE_SM100 accurately reflect SM100 grouped MoE availability:
Defined only on SM100 arches (10.0a/10.1a/10.3a on CUDA 12.8–12.9, or 10.0f on CUDA ≥13.0).
Note: I may be misinterpreting the purpose of
ENABLE_CUTLASS_MOE_SM100as it may be being used not literally to mean only SM100. However, if this is the case, this imprecision should be addressed to fix the code and reduce confusion.Test Plan
I’m unable to build for all affected architectures; it should be tested on SM90, SM100, SM110, and SM120.
Test Result
Not available.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.