Building llama.cpp container images for Spark/GB10

Hi!

For those who like to run $THINGS in containers, I tried to find a way to build a docker image for llama.cpp as there are currently no containers at all for arm64 - only for amd64. [1]

[1] see open issue for details [Tracker] Docker build fails on CI for arm64 · Issue #11888 · ggml-org/llama.cpp · GitHub

So I tried to find out how the normal build process looks like and why it is failing for arm64 and/or how to get it running for our GB10s.

The standard docker files for are located in a folder .devops of the official repo. There is also one for CUDA, but that fails for GB10 out of the box. The main reason for that is a wrong LD_LIBRARY_PATH, so you will get an error:

#14 92.07 [ 62%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
#14 92.23 [ 62%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
#14 92.33 [ 62%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
#14 92.38 [ 62%] Linking CXX executable ../../bin/llama-simple
#14 92.44 /usr/bin/ld: warning: libcuda.so.1, needed by ../../bin/libggml-cuda.so.0.9.4, not found (try using -rpath or -rpath-link)
#14 92.45 [ 62%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o

In the devel container used by llama.cpp nvidia/cuda:13.0.2-devel-ubuntu24.04 the LD_LIBRARY_PATH points to:

root@3e1024dcf4f9:$ env|grep LIBRARY
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

The library needed is to be found in /usr/local/cuda-13/compat - so you need to adjust the ENV for that container.

So just add

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

for the build section. And I added the CUDA architecture for cmake. If not specified cmake tries to build for all visible architectures (if I understood the docs correctly). But the normal docker build process does not have access to the GPU while building. You can change this using buildkit defining a builder with GPU support (see Container Device Interface (CDI) | Docker Docs) which is much too complicated, but may be useful in other projects.

The modified dockerfile can be found here llama.cpp Dockerfile for DGX Spark / GB10 · GitHub

So all steps to build a server container image:

mkdir src
cd src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp/.devops
wget https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile
cd ..
docker build -f .devops/spark.Dockerfile --target server -t llama.cpp:server-spark .

Hope that saves other some time while trying to build $THINGS which are built similar.

May be I can push that upstream - so the llama.cpp team will integrate it into their build process.

Feedback welcome.

3 Likes

@cosinus thanks a lot !

here are the instructions !

1️⃣ Build a GB10-compatible llama.cpp Docker image

Goal: have a llama.cpp:server-spark image that works correctly on your NVIDIA GB10 (arm64).

# 1. Create a working directory
mkdir -p ~/src
cd ~/src

# 2. Clone the official llama.cpp repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# 3. Go into the .devops folder
cd .devops

# 4. Download the special Dockerfile for DGX Spark / GB10
wget "https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile"

# 5. Go back to repo root
cd ..

Then build the server image with CUDA + GB10 fix:

docker build \
  -f .devops/spark.Dockerfile \
  --target server \
  -t llama.cpp:server-spark .

This image:

  • Uses nvidia/cuda:13.0.2-devel-ubuntu24.04 as base

  • Fixes LD_LIBRARY_PATH to include /usr/local/cuda-13/compat

  • Builds llama-server for arm64 + GB10.


2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

We chose:

unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf

On your Spark:

mkdir -p /home/user/models
cd /home/user/models

wget "https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/resolve/main/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf" \
  -O mistral-small-3.2-24b-ud-q4_k_xl.gguf

ls -lh /home/user/models

You should see the GGUF file in that folder.


3️⃣ Run llama.cpp server container on port 3010

We then started a container using the image you built and the model you downloaded:

docker run -d \
  --name llama-spark-mistral32 \
  --gpus all \
  -p 3010:8080 \
  -v /home/user/models:/models \
  llama.cpp:server-spark \
    --host 0.0.0.0 \
    --port 8080 \
    -m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf \
    --ctx-size 16384 \
    --threads -1 \
    --n-gpu-layers 99 \
    --flash-attn auto

Check it’s running:

docker ps | grep llama

You should see llama-spark-mistral32 in Up state.


4️⃣ Test the HTTP API on the Spark

Still on the Spark:

curl -s http://localhost:3010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-3.2-24b-ud-q4_k_xl.gguf",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Say a very short sentence in English." }
    ],
    "max_tokens": 64,
    "temperature": 0.4
  }'

From your Mac, you then call the same endpoint using the Spark IP:

LLAMA_URL = "http://xx.xx.xx.xx:3010/v1/chat/completions"
LLAMA_MODEL = "mistral-small-3.2-24b-ud-q4_k_xl.gguf"


5️⃣ Update your prompt generation script (mascot prompts)

We replaced:

  • OLLAMA_URL → with LLAMA_URL pointing to http://xx.xx.xx.xx:3010/v1/chat/completions

  • model: "gpt-oss:20b" → with model: "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

2 Likes

@cosinus now big question ! what is justifying from the first launch in llama.cpp → 90-95% of the GPU usage….

have a look !

answering to my-self ;)



docker run -d 
–name llama-spark-mistral32 
–gpus all 
-p 3010:8080 
-v /home/user/models:/models 
llama.cpp:server-spark 
–host 0.0.0.0 
–port 8080 
-m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf 
–ctx-size 4096 
–threads -1 
–n-gpu-layers 16 
–flash-attn auto

better by limiting the gpu layer to 16…

around 27-30% of GPU use ;)

You should see this behavior on every GPU (>90% utilization). Whenever you fire a request the GPU usage goes up to near 100% as long as your requests is being processed. After it is finished it goes down to zero again - assuming that it is a single user, sending its requests one by one. GPUs in use by multiple user might never go down to zero for a long(er) time. 😅

If you install nvtop (needs a patched version for Spark[1]) you will see something like this:

while running and when finished:

GPUs are designed for massive parallelism, meaning their thousands of cores are meant to be used all at once. That’s what makes them so fast compared to CPUs. Tasks split into many smaller pieces which can be done in parallel.

[1] NVTOP with DGX Spark unified memory support

On Spark, you always want to set --n-gpu-layers or -ngl (same parameter) to a large number (999 is a good one), so ALL layers are processed by GPU. There is no point in offloading any layers to CPU as Spark uses unified memory architecture. You will just lose performance.