Issue with run gpt-oss-120b in vLLM

AastaLLL · October 2, 2025, 6:08am

Hi,

Thanks for your patience.

Below are the steps to run gpt-oss-20b with our new vLLM container.
(Test with 20b model, but 120b is expected to work as well)

For gpt-oss model, you will need the WAR mentioned in this link on Harmony encoding.

$ sudo docker run -it --rm nvcr.io/nvidia/vllm:25.09-py3

# mkdir /etc/encodings
# wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O /etc/encodings/cl100k_base.tiktoken
# wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O /etc/encodings/o200k_base.tiktoken
# export TIKTOKEN_ENCODINGS_BASE=/etc/encodings
# vllm serve openai/gpt-oss-20b
...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [cuda.py:323] Using Triton backend on V1 engine.
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(EngineCore_0 pid=1222) INFO 10-02 05:50:40 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:01<00:03,  1.51s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:03<00:01,  1.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.74s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.71s/it]
(EngineCore_0 pid=1222)
(EngineCore_0 pid=1222) INFO 10-02 05:50:46 [default_loader.py:262] Loading weights took 5.27 seconds
(EngineCore_0 pid=1222) WARNING 10-02 05:50:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_0 pid=1222) INFO 10-02 05:50:48 [gpu_model_runner.py:2007] Model loading took 13.7193 GiB and 9.283907 seconds
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/ac91ec61b3/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:559] Dynamo bytecode transform time: 3.40 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.612 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [monitor.py:34] torch.compile takes 3.40 s in total
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [gpu_worker.py:276] Available KV cache memory: 94.61 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1013] GPU KV cache size: 2,066,752 tokens
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 31.02x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:11<00:00,  7.41it/s]
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [gpu_model_runner.py:2708] Graph capturing finished in 12 secs, took 0.96 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [core.py:214] init engine (profile, create kv cache, warmup model) took 21.89 seconds
(APIServer pid=1150) INFO 10-02 05:51:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 258345
(APIServer pid=1150) INFO 10-02 05:51:13 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1150) WARNING 10-02 05:51:14 [serving_responses.py:137] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1150) INFO 10-02 05:51:15 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8000
...

Thanks.

Topic		Replies	Views
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , nemotron	46	2262	December 14, 2025
Thor开发板上测试vllm失败 Jetson Thor generative_ai	9	185	November 5, 2025
Install vllm in Thor failed Jetson Thor generative_ai	6	794	October 16, 2025
vLLM container 25.10-py3 fails to start Jetson Thor generative_ai	13	321	December 8, 2025
Run VLLM in Thor from VLLM Repository Jetson Thor	15	825	November 29, 2025
求救，运行vllm报错 Jetson Thor camera , generative_ai	3	116	November 17, 2025
Run VLLM in Spark DGX Spark / GB10	93	4114	December 9, 2025
Triton Inference Server + vLLM Backend on the NVIDIA Jetson AGX Orin 64GB Developer Kit Jetson Projects generative_ai	9	978	June 16, 2025
Run llm stuck while use jetson thor Jetson Thor cuda , generative_ai	7	307	September 25, 2025
vLLM container out of date for new models DGX Spark / GB10	10	1060	November 14, 2025

Issue with run gpt-oss-120b in vLLM

Related topics