Install vllm in Thor failed

According to the Quickstart - vLLM

uv venv --python 3.12 --seed

source .venv/bin/activate

uv pip install vllm --torch-backend=auto

report error:

  "/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
  line 368, in run
      self.build_extensions()
    File "<string>", line 232, in build_extensions
    File "<string>", line 210, in configure
    File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake',
  '/home/nvidia/.cache/uv/sdists-v9/pypi/vllm/0.10.1.1/itULH14ewSqkQjUU-v7LE/src',
  '-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda',
  '-DVLLM_PYTHON_EXECUTABLE=/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/bin/python',
  '-DVLLM_PYTHON_PATH=/usr/lib/python312.zip:/usr/lib/python3.12:/usr/lib/python3.12/lib-dynload:/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages:/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages/setuptools/_vendor',
  '-DFETCHCONTENT_BASE_DIR=/home/nvidia/.cache/uv/sdists-v9/pypi/vllm/0.10.1.1/itULH14ewSqkQjUU-v7LE/src/.deps',
  '-DNVCC_THREADS=1', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile',
  '-DCMAKE_JOB_POOLS:STRING=compile=14',
  '-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc']' returned non-zero exit
  status 1.

  hint: This usually indicates a problem with the package or the build environment.

Hi,
We will check whether vLLM is supported on AGX Thro or not. There are some working solutions in

Making sure you're not a bot!

You may give it a try.

Hi,

Does vLLM container work for you?
If yes, you can find the vLLM container for Thor below:

nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3

Thanks.

As the moderators suggested, you can’t just install vLLM the normal way yet.

However, if you follow their suggestion, you can run the docker container like this:

mkdir ~/.cache/nim
export LOCAL_NIM_CACHE=~/.cache/nim
docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged -it --rm -u 0:0 --name=testvllm --ipc=host -v "$LOCAL_NIM_CACHE:/root/.cache" nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3

(you should export the .cache directory from your home directory unless you prefer something else. This will keep the LLMs you download from disappearing if you delete the container)

Once in the container, do the following:

  • log in to HuggingFace with your own token (free to get)
  • hf auth login
    
  • then download a model – I haven’t been able to make medium to large models fit even though there would seem to be enough memory; here’s a small one that works
  • you made need to empty the memory cache as well before running the model
  • hf download gghfez/gemma-3-4b-novision
    
  • then run the vllm server, picking your port and allowing hosts from other machines
  • export HF_HOME=/root/.cache/huggingface
    
  • vllm serve gghfez/gemma-3-4b-novision --host 0.0.0.0 --port 1234
    
  • curl http://localhost:1234/v1/models [from outside the container]
    
  • curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
    "messages": [   
        {
            "role": "system",
            "content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
        },
        {
            "role": "user",
            "content": "Write a limerick about Python exceptions"
        }
      ]
    }'
    
  • the model takes a while to load and sucked up lots of memory (or the server did); I’m sure some parameters should be added to adjust this

[added after more experimentation:]

I really don’t know what I’m doing with vllm, but I was able to reduce memory usage and improve performance with a few tweaks. I’ve had much better luck on the Thor with llama.cpp that I built in Nvidia’s latest PyTorch container.

FWIW you might want to fire up the server with this command and parameters. I’m sure others could greatly improve it.

vllm serve gghfez/gemma-3-4b-novision --host 0.0.0.0 --port 1234 --max-model-len 1k --gpu-memory-utilization 0.5 --max-num-batched-tokens 2048 --tensor-parallel-size 1 --enable-prefix-caching
3 Likes

Hi,

Thanks a lot for sharing the details.
We will release the vLLM container for Thor on NGC, so it’s recommended to use it instead of other toolkits.

@liu.jialu Does the vLLM container also work on your side?

Thanks.

Well, that was pretty fast:

Announcement of updated vLLM for Thor

I am now testing it. The Llama 3.3 70B model loaded fine, but I haven’t done the benchmark tests to compare this vLLM version with the previous one.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.