Hi,
Thanks for your patience.
Below are the steps to run gpt-oss-20b with our new vLLM container.
(Test with 20b model, but 120b is expected to work as well)
For gpt-oss model, you will need the WAR mentioned in this link on Harmony encoding.
$ sudo docker run -it --rm nvcr.io/nvidia/vllm:25.09-py3
# mkdir /etc/encodings
# wget https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken -O /etc/encodings/cl100k_base.tiktoken
# wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken -O /etc/encodings/o200k_base.tiktoken
# export TIKTOKEN_ENCODINGS_BASE=/etc/encodings
# vllm serve openai/gpt-oss-20b
...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [cuda.py:323] Using Triton backend on V1 engine.
(EngineCore_0 pid=1222) INFO 10-02 05:50:38 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
(EngineCore_0 pid=1222) INFO 10-02 05:50:40 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:01<00:03, 1.51s/it]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:03<00:01, 1.69s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.74s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00, 1.71s/it]
(EngineCore_0 pid=1222)
(EngineCore_0 pid=1222) INFO 10-02 05:50:46 [default_loader.py:262] Loading weights took 5.27 seconds
(EngineCore_0 pid=1222) WARNING 10-02 05:50:46 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore_0 pid=1222) INFO 10-02 05:50:48 [gpu_model_runner.py:2007] Model loading took 13.7193 GiB and 9.283907 seconds
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/ac91ec61b3/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=1222) INFO 10-02 05:50:52 [backends.py:559] Dynamo bytecode transform time: 3.40 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.612 s
(EngineCore_0 pid=1222) INFO 10-02 05:50:54 [monitor.py:34] torch.compile takes 3.40 s in total
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [gpu_worker.py:276] Available KV cache memory: 94.61 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1013] GPU KV cache size: 2,066,752 tokens
(EngineCore_0 pid=1222) INFO 10-02 05:50:56 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 31.02x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:11<00:00, 7.41it/s]
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [gpu_model_runner.py:2708] Graph capturing finished in 12 secs, took 0.96 GiB
(EngineCore_0 pid=1222) INFO 10-02 05:51:10 [core.py:214] init engine (profile, create kv cache, warmup model) took 21.89 seconds
(APIServer pid=1150) INFO 10-02 05:51:13 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 258345
(APIServer pid=1150) INFO 10-02 05:51:13 [api_server.py:1611] Supported_tasks: ['generate']
(APIServer pid=1150) WARNING 10-02 05:51:14 [serving_responses.py:137] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=1150) INFO 10-02 05:51:15 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8000
...
Thanks.