vLLM container out of date for new models

In the “Install and Use vLLM for Inference” guide for Spark, the nvcr.io/nvidia/vllm:25.09-py3 image packages vllm v0.10.1.

Subsequent upstream releases of v0.10.2 and v0.11.0 brought compatibility for major new models such as Qwen3-Next and Qwen3-VL that I was hoping to use.

From reading online, I understand this is the very first release of an nvidia-specific vllm container, so my first question is - what can we expect for a release cadence to integrate upstream releases?

The second question would be - is there any workaround process to setup arbitrary vllm versions on Spark? I tried naively upgrading vllm inside the docker image, but this broke CUDA compatibility immediately and it got lost looking for libcudart.so.12.

I was similarly unsuccessful on launching with the primary vllm project’s docker image, with various CUDA library issues (tried with both Cuda13 and Cuda12).

Appreciate any help or insight on future releases - thanks!

2 Likes

Run VLLM in Spark

  1. Install uv and python3
curl -LsSf https://astral.sh/uv/install.sh | sh
sudo apt install python3-dev python3.12-dev
  1. Create environment
uv venv .vllm --python 3.12
source .vllm/bin/activate
  1. Install Pytorch
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
  1. Install flashinfer and triton
uv pip install xgrammar triton flashinfer-python --prerelease=allow
git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .
  1. Export variables
export TORCH_CUDA_ARCH_LIST=12.1a # Spark 12.1, 12.0f, 12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
  1. Clean memory
sudo sysctl -w vm.drop_caches=3
  1. Run gptoss 120b
mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings
# mxfp8 activation for MoE. faster, but higher risk for accuracy.
export VLLM_USE_FLASHINFER_MXFP4_MOE=1 
uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7

it fails triton backend for you,
delete triton_kernels path, and compile/install triton from main

5 Likes

Nice, thank you! I ended up building new VLLM from main branch inside the provided container, and it works just fine, but nice to have an option to build it on the host system. I guess I was missing some environment variables when building it on the host system.

1 Like

surely cccl problem:

export CPLUS_INCLUDE_PATH=/usr/local/cuda/include/cccl${CPLUS_INCLUDE_PATH:+:${CPLUS_INCLUDE_PATH}}
export C_INCLUDE_PATH=/usr/local/cuda/include/cccl${C_INCLUDE_PATH:+:${C_INCLUDE_PATH}}
1 Like

No, there is a bug in VLLM: [Bug]: Undefined symbol cutlass_moe_mm_sm100 on SM120 CUDA builds (macro enabled, grouped_mm_c3x_sm100.cu not compiled) · Issue #26843 · vllm-project/vllm · GitHub

1 Like

For anyone else landing here these steps won’t work without the following on the spark:

sudo apt update

sudo apt install python3.12-dev
2 Likes

yes, that is true

Thanks for this, these are nice clear instructions.

I found that I needed to include, following the hints, “–prerelease=allow” in the build command on step 3 above. Otherwise it complained about issues with flashinfer-python and apache-tvm-ffi versions.

I’m curious, for TORCH_CUDA_ARCH_LIST, I’ve previously only seen “12.1” as the option for the GB10. I haven’t stumbled on any documentation listing “12.1a”. Where can I find further documentation on this one?

I’ve been through several iterations of trying to get a venv that has all the right dependencies to run vllm and keep ultimately bumping up against:
ImportError: /opt/vllm/vllm/_C.abi3.so: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb

As I did with when I reached the end of these steps. Any additional suggestions? Is this potentially not related and I should open a separate thread?

Thank you!

Apparently, you need to specify 12.0f for CUDA 13: source.

Also, to avoid undefined symbol errors, you need to apply the patch from that unmerged pull request (although if you set 12.0f, I’m not sure if you still need the patch - I’m trying to compile it without it now. EDIT: nope, still needed):

cat <<'EOF' | patch -p1
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7cb94f919..f860e533e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -594,9 +594,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
 
   # FP4 Archs and flags
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(FP4_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(FP4_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
-    cuda_archs_loose_intersection(FP4_ARCHS "10.0a;10.1a;12.0a;12.1a" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(FP4_ARCHS "10.0a;10.1a" "${CUDA_ARCHS}")
   endif()
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS)
     set(SRCS
@@ -668,7 +668,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
   endif()
 
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
     cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
   endif()
@@ -716,9 +716,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
   endif()
 
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
   endif()
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
     set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu")
EOF

Please note that you need to set up the following environment variables BEFORE compiling vllm, otherwise CMake won’t pick up arch:

export TORCH_CUDA_ARCH_LIST=12.0f # CUDA 13 needs 12.1f
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

There is also an existing thread on this: Run VLLM in Spark - #33 by eugr

2 Likes

Excellent, thank you @eugr !

I gave it a run and those were the small changes I needed to get vllm to build appropriately. I really appreciate your insight.

Thanks again, and happy experimenting!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.