Building llama.cpp container images for Spark/GB10

cosinus · December 5, 2025, 10:36am

Hi!

For those who like to run $THINGS in containers, I tried to find a way to build a docker image for llama.cpp as there are currently no containers at all for arm64 - only for amd64. [1]

[1] see open issue for details [Tracker] Docker build fails on CI for arm64 · Issue #11888 · ggml-org/llama.cpp · GitHub

So I tried to find out how the normal build process looks like and why it is failing for arm64 and/or how to get it running for our GB10s.

The standard docker files for are located in a folder .devops of the official repo. There is also one for CUDA, but that fails for GB10 out of the box. The main reason for that is a wrong LD_LIBRARY_PATH, so you will get an error:

#14 92.07 [ 62%] Building CXX object common/CMakeFiles/common.dir/peg-parser.cpp.o
#14 92.23 [ 62%] Building CXX object common/CMakeFiles/common.dir/regex-partial.cpp.o
#14 92.33 [ 62%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
#14 92.38 [ 62%] Linking CXX executable ../../bin/llama-simple
#14 92.44 /usr/bin/ld: warning: libcuda.so.1, needed by ../../bin/libggml-cuda.so.0.9.4, not found (try using -rpath or -rpath-link)
#14 92.45 [ 62%] Building CXX object common/CMakeFiles/common.dir/speculative.cpp.o

In the devel container used by llama.cpp nvidia/cuda:13.0.2-devel-ubuntu24.04 the LD_LIBRARY_PATH points to:

root@3e1024dcf4f9:$ env|grep LIBRARY
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64

The library needed is to be found in /usr/local/cuda-13/compat - so you need to adjust the ENV for that container.

So just add

ENV LD_LIBRARY_PATH=/usr/local/cuda-13/compat

for the build section. And I added the CUDA architecture for cmake. If not specified cmake tries to build for all visible architectures (if I understood the docs correctly). But the normal docker build process does not have access to the GPU while building. You can change this using buildkit defining a builder with GPU support (see Container Device Interface (CDI) | Docker Docs) which is much too complicated, but may be useful in other projects.

The modified dockerfile can be found here llama.cpp Dockerfile for DGX Spark / GB10 · GitHub

So all steps to build a server container image:

mkdir src
cd src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp/.devops
wget https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile
cd ..
docker build -f .devops/spark.Dockerfile --target server -t llama.cpp:server-spark .

Hope that saves other some time while trying to build $THINGS which are built similar.

May be I can push that upstream - so the llama.cpp team will integrate it into their build process.

Feedback welcome.

deeduckme · December 6, 2025, 1:09pm

@cosinus thanks a lot !

here are the instructions !

1️⃣ Build a GB10-compatible `llama.cpp` Docker image

Goal: have a llama.cpp:server-spark image that works correctly on your NVIDIA GB10 (arm64).

# 1. Create a working directory
mkdir -p ~/src
cd ~/src

# 2. Clone the official llama.cpp repo
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# 3. Go into the .devops folder
cd .devops

# 4. Download the special Dockerfile for DGX Spark / GB10
wget "https://gist.githubusercontent.com/stelterlab/33885c600c102792acb1638ca7d2d7e9/raw/ad4e1edc488642172afa61a7ac9d29bf146c4a36/spark.Dockerfile"

# 5. Go back to repo root
cd ..

Then build the server image with CUDA + GB10 fix:

docker build \
  -f .devops/spark.Dockerfile \
  --target server \
  -t llama.cpp:server-spark .

This image:

Uses nvidia/cuda:13.0.2-devel-ubuntu24.04 as base
Fixes LD_LIBRARY_PATH to include /usr/local/cuda-13/compat
Builds llama-server for arm64 + GB10.

2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

We chose:

unsloth/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf

On your Spark:

mkdir -p /home/user/models
cd /home/user/models

wget "https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF/resolve/main/Mistral-Small-3.2-24B-Instruct-2506-UD-Q4_K_XL.gguf" \
  -O mistral-small-3.2-24b-ud-q4_k_xl.gguf

ls -lh /home/user/models

You should see the GGUF file in that folder.

3️⃣ Run `llama.cpp` server container on port 3010

We then started a container using the image you built and the model you downloaded:

docker run -d \
  --name llama-spark-mistral32 \
  --gpus all \
  -p 3010:8080 \
  -v /home/user/models:/models \
  llama.cpp:server-spark \
    --host 0.0.0.0 \
    --port 8080 \
    -m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf \
    --ctx-size 16384 \
    --threads -1 \
    --n-gpu-layers 99 \
    --flash-attn auto

Check it’s running:

docker ps | grep llama

You should see llama-spark-mistral32 in Up state.

4️⃣ Test the HTTP API on the Spark

Still on the Spark:

curl -s http://localhost:3010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-small-3.2-24b-ud-q4_k_xl.gguf",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Say a very short sentence in English." }
    ],
    "max_tokens": 64,
    "temperature": 0.4
  }'

From your Mac, you then call the same endpoint using the Spark IP:

LLAMA_URL = "http://xx.xx.xx.xx:3010/v1/chat/completions"
LLAMA_MODEL = "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

5️⃣ Update your prompt generation script (mascot prompts)

We replaced:

OLLAMA_URL → with LLAMA_URL pointing to http://xx.xx.xx.xx:3010/v1/chat/completions
model: "gpt-oss:20b" → with model: "mistral-small-3.2-24b-ud-q4_k_xl.gguf"

deeduckme · December 6, 2025, 1:11pm

@cosinus now big question ! what is justifying from the first launch in llama.cpp → 90-95% of the GPU usage….

have a look !

deeduckme · December 6, 2025, 5:26pm

answering to my-self ;)



docker run -d 
–name llama-spark-mistral32 
–gpus all 
-p 3010:8080 
-v /home/user/models:/models 
llama.cpp:server-spark 
–host 0.0.0.0 
–port 8080 
-m /models/mistral-small-3.2-24b-ud-q4_k_xl.gguf 
–ctx-size 4096 
–threads -1 
–n-gpu-layers 16 
–flash-attn auto

better by limiting the gpu layer to 16…

around 27-30% of GPU use ;)

cosinus · December 7, 2025, 9:43am

You should see this behavior on every GPU (>90% utilization). Whenever you fire a request the GPU usage goes up to near 100% as long as your requests is being processed. After it is finished it goes down to zero again - assuming that it is a single user, sending its requests one by one. GPUs in use by multiple user might never go down to zero for a long(er) time. 😅

If you install nvtop (needs a patched version for Spark[1]) you will see something like this:

while running and when finished:

GPUs are designed for massive parallelism, meaning their thousands of cores are meant to be used all at once. That’s what makes them so fast compared to CPUs. Tasks split into many smaller pieces which can be done in parallel.

[1] NVTOP with DGX Spark unified memory support

eugr · December 11, 2025, 7:23pm

On Spark, you always want to set --n-gpu-layers or -ngl (same parameter) to a large number (999 is a good one), so ALL layers are processed by GPU. There is no point in offloading any layers to CPU as Spark uses unified memory architecture. You will just lose performance.

Topic		Replies	Views
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	25	964	December 12, 2025
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	18	880	December 4, 2025
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	913	November 29, 2025
Docker container image for recent vLLM release that enables GGUF loading Docker and NVIDIA Docker	3	351	October 29, 2025
Model Orchestration and Deployment DGX Spark / GB10 nim	4	211	November 24, 2025
NIM LLM Containers Fail on DGX Spark (GB10): Triton/vLLM Crash on sm_121 and NGC Permission Errors DGX Spark / GB10 jetson , nim , llama , nemotron	2	108	December 3, 2025
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	2	1802	October 23, 2025
vLLM container out of date for new models DGX Spark / GB10	10	1057	November 14, 2025
Installing llama.cpp Jetson Orin NX cuda , llama	4	12	December 12, 2025
DGX Spark crashes when running tensorrt-llm DGX Spark / GB10 llama	1	48	December 5, 2025

Building llama.cpp container images for Spark/GB10

1️⃣ Build a GB10-compatible llama.cpp Docker image

2️⃣ Download your model (Mistral Small 3.2 24B GGUF)

3️⃣ Run llama.cpp server container on port 3010

4️⃣ Test the HTTP API on the Spark

5️⃣ Update your prompt generation script (mascot prompts)

Related topics

1️⃣ Build a GB10-compatible `llama.cpp` Docker image

3️⃣ Run `llama.cpp` server container on port 3010