vLLM container out of date for new models

artrajala · October 24, 2025, 5:36am

In the “Install and Use vLLM for Inference” guide for Spark, the nvcr.io/nvidia/vllm:25.09-py3 image packages vllm v0.10.1.

Subsequent upstream releases of v0.10.2 and v0.11.0 brought compatibility for major new models such as Qwen3-Next and Qwen3-VL that I was hoping to use.

From reading online, I understand this is the very first release of an nvidia-specific vllm container, so my first question is - what can we expect for a release cadence to integrate upstream releases?

The second question would be - is there any workaround process to setup arbitrary vllm versions on Spark? I tried naively upgrading vllm inside the docker image, but this broke CUDA compatibility immediately and it got lost looking for libcudart.so.12.

I was similarly unsuccessful on launching with the primary vllm project’s docker image, with various CUDA library issues (tried with both Cuda13 and Cuda12).

Appreciate any help or insight on future releases - thanks!

johnny_nv · October 24, 2025, 4:51pm

Run VLLM in Spark

Install uv and python3

curl -LsSf https://astral.sh/uv/install.sh | sh
sudo apt install python3-dev python3.12-dev

Create environment

uv venv .vllm --python 3.12
source .vllm/bin/activate

Install Pytorch

uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

Install flashinfer and triton

uv pip install xgrammar triton flashinfer-python --prerelease=allow

git clone --recursive https://github.com/vllm-project/vllm.git
cd vllm
python3 use_existing_torch.py
uv pip install -r requirements/build.txt
uv pip install --no-build-isolation -e .

Export variables

export TORCH_CUDA_ARCH_LIST=12.1a # Spark 12.1, 12.0f, 12.1a
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Clean memory

sudo sysctl -w vm.drop_caches=3

Run gptoss 120b

mkdir -p tiktoken_encodings
wget -O tiktoken_encodings/o200k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken"
wget -O tiktoken_encodings/cl100k_base.tiktoken "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
export TIKTOKEN_ENCODINGS_BASE=${PWD}/tiktoken_encodings

# mxfp8 activation for MoE. faster, but higher risk for accuracy.
export VLLM_USE_FLASHINFER_MXFP4_MOE=1 
uv run vllm serve "openai/gpt-oss-120b" --async-scheduling --port 8000 --host 0.0.0.0 --trust_remote_code --swap-space 16 --max-model-len 32000 --tensor-parallel-size 1 --max-num-seqs 1024 --gpu-memory-utilization 0.7

it fails triton backend for you,
delete triton_kernels path, and compile/install triton from main

eugr · October 24, 2025, 5:26pm

Nice, thank you! I ended up building new VLLM from main branch inside the provided container, and it works just fine, but nice to have an option to build it on the host system. I guess I was missing some environment variables when building it on the host system.

johnny_nv · October 24, 2025, 6:09pm

surely cccl problem:

export CPLUS_INCLUDE_PATH=/usr/local/cuda/include/cccl${CPLUS_INCLUDE_PATH:+:${CPLUS_INCLUDE_PATH}}
export C_INCLUDE_PATH=/usr/local/cuda/include/cccl${C_INCLUDE_PATH:+:${C_INCLUDE_PATH}}

eugr · October 26, 2025, 4:50am

No, there is a bug in VLLM: [Bug]: Undefined symbol cutlass_moe_mm_sm100 on SM120 CUDA builds (macro enabled, grouped_mm_c3x_sm100.cu not compiled) · Issue #26843 · vllm-project/vllm · GitHub

UmbrellaCodr · October 27, 2025, 4:39am

For anyone else landing here these steps won’t work without the following on the spark:

sudo apt update

sudo apt install python3.12-dev

johnny_nv · October 27, 2025, 11:11pm

yes, that is true

terrence.cuny1 · October 30, 2025, 8:28pm

Thanks for this, these are nice clear instructions.

I found that I needed to include, following the hints, “–prerelease=allow” in the build command on step 3 above. Otherwise it complained about issues with flashinfer-python and apache-tvm-ffi versions.

I’m curious, for TORCH_CUDA_ARCH_LIST, I’ve previously only seen “12.1” as the option for the GB10. I haven’t stumbled on any documentation listing “12.1a”. Where can I find further documentation on this one?

I’ve been through several iterations of trying to get a venv that has all the right dependencies to run vllm and keep ultimately bumping up against:
ImportError: /opt/vllm/vllm/_C.abi3.so: undefined symbol: _Z20cutlass_moe_mm_sm100RN2at6TensorERKS0_S3_S3_S3_S3_S3_S3_S3_S3_bb

As I did with when I reached the end of these steps. Any additional suggestions? Is this potentially not related and I should open a separate thread?

Thank you!

eugr · October 30, 2025, 8:39pm

Apparently, you need to specify 12.0f for CUDA 13: source.

Also, to avoid undefined symbol errors, you need to apply the patch from that unmerged pull request (although if you set 12.0f, I’m not sure if you still need the patch - I’m trying to compile it without it now. EDIT: nope, still needed):

cat <<'EOF' | patch -p1
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 7cb94f919..f860e533e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -594,9 +594,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
 
   # FP4 Archs and flags
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(FP4_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(FP4_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
-    cuda_archs_loose_intersection(FP4_ARCHS "10.0a;10.1a;12.0a;12.1a" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(FP4_ARCHS "10.0a;10.1a" "${CUDA_ARCHS}")
   endif()
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND FP4_ARCHS)
     set(SRCS
@@ -668,7 +668,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
   endif()
 
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
     cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a" "${CUDA_ARCHS}")
   endif()
@@ -716,9 +716,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
   endif()
 
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f;11.0f;12.0f" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0f" "${CUDA_ARCHS}")
   else()
-    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a;12.0a;12.1a" "${CUDA_ARCHS}")
+    cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
   endif()
   if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.8 AND SCALED_MM_ARCHS)
     set(SRCS "csrc/quantization/w8a8/cutlass/moe/blockwise_scaled_group_mm_sm100.cu")
EOF

Please note that you need to set up the following environment variables BEFORE compiling vllm, otherwise CMake won’t pick up arch:

export TORCH_CUDA_ARCH_LIST=12.0f # CUDA 13 needs 12.1f
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas

There is also an existing thread on this: Run VLLM in Spark - #33 by eugr

terrence.cuny1 · October 31, 2025, 1:15am

Excellent, thank you @eugr !

I gave it a run and those were the small changes I needed to get vllm to build appropriately. I really appreciate your insight.

Thanks again, and happy experimenting!

system · November 14, 2025, 1:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Run VLLM in Spark DGX Spark / GB10	93	4107	December 9, 2025
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	916	November 29, 2025
Run VLLM in Thor from VLLM Repository Jetson Thor	15	822	November 29, 2025
vLLM Container issue in DGX Spark DGX Spark / GB10	5	292	November 11, 2025
vLLM container 25.10-py3 fails to start Jetson Thor generative_ai	13	321	December 8, 2025
Install vllm in Thor failed Jetson Thor generative_ai	6	793	October 16, 2025
Docker container image for recent vLLM release that enables GGUF loading Docker and NVIDIA Docker	3	351	October 29, 2025
求救，运行vllm报错 Jetson Thor camera , generative_ai	3	115	November 17, 2025
Run vllm fail Jetson Thor generative_ai	2	251	September 11, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	1924	December 9, 2025

vLLM container out of date for new models

Run VLLM in Spark

Related topics