"/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages/setuptools/_distutils/command/build_ext.py",
line 368, in run
self.build_extensions()
File "<string>", line 232, in build_extensions
File "<string>", line 210, in configure
File "/usr/lib/python3.12/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake',
'/home/nvidia/.cache/uv/sdists-v9/pypi/vllm/0.10.1.1/itULH14ewSqkQjUU-v7LE/src',
'-G', 'Ninja', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DVLLM_TARGET_DEVICE=cuda',
'-DVLLM_PYTHON_EXECUTABLE=/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/bin/python',
'-DVLLM_PYTHON_PATH=/usr/lib/python312.zip:/usr/lib/python3.12:/usr/lib/python3.12/lib-dynload:/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages:/home/nvidia/.cache/uv/builds-v0/.tmpjXZPq7/lib/python3.12/site-packages/setuptools/_vendor',
'-DFETCHCONTENT_BASE_DIR=/home/nvidia/.cache/uv/sdists-v9/pypi/vllm/0.10.1.1/itULH14ewSqkQjUU-v7LE/src/.deps',
'-DNVCC_THREADS=1', '-DCMAKE_JOB_POOL_COMPILE:STRING=compile',
'-DCMAKE_JOB_POOLS:STRING=compile=14',
'-DCMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc']' returned non-zero exit
status 1.
hint: This usually indicates a problem with the package or the build environment.
docker run --ipc=host --net host --gpus all --runtime=nvidia --privileged -it --rm -u 0:0 --name=testvllm --ipc=host -v "$LOCAL_NIM_CACHE:/root/.cache" nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3
(you should export the .cache directory from your home directory unless you prefer something else. This will keep the LLMs you download from disappearing if you delete the container)
Once in the container, do the following:
log in to HuggingFace with your own token (free to get)
hf auth login
then download a model – I haven’t been able to make medium to large models fit even though there would seem to be enough memory; here’s a small one that works
you made need to empty the memory cache as well before running the model
hf download gghfez/gemma-3-4b-novision
then run the vllm server, picking your port and allowing hosts from other machines
curl http://localhost:1234/v1/models [from outside the container]
curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
the model takes a while to load and sucked up lots of memory (or the server did); I’m sure some parameters should be added to adjust this
[added after more experimentation:]
I really don’t know what I’m doing with vllm, but I was able to reduce memory usage and improve performance with a few tweaks. I’ve had much better luck on the Thor with llama.cpp that I built in Nvidia’s latest PyTorch container.
FWIW you might want to fire up the server with this command and parameters. I’m sure others could greatly improve it.