Hi, Community,
I really need some help on this topic to identify if either my hardware is faulty or to identify the bottleneck on my system.
After the inital setup and finishing all NVIDIA updates I installed the open-webui:ollama container as explained. Then I was running this command to see the first benchmark-result:
docker exec -it open-webui ollama run deepseek-coder:33b-instruct “Write a python function that calculates the factorial of a number.” --verbose
I assume ollama is still not optimized for DGX Spark yet.
I would recommend to replace it with a current llama.cpp build/container which is very fast and has already been optimized for Spark with the support of NVIDIA.
see discussion on github:
I wouldn’t use deepseek-coder anymore. Try Qwen/Qwen3-Coder-30B-A3B-Instruct or openai/gpt-oss-20b or -120b instead. Those are MoE models which are very fast even on a Spark.
Don’t use Ollama - llama.cpp is faster and gives you more flexibility in loading and configuring models.
Having said that, 12-13 t/s is about what I’d expect from 33B dense model on Spark.
Deepseek-coder is very old, there are better and faster models available. For Spark you want to use MOE models as they have very few active parameters, so they can be fast.
Spark has memory bandwidth of 273 GB/s. It’s about 2x slower than desktop RTX 5070, about 3x slower than desktop 4090 and about 6x slower than desktop 5090. Token generation (inference) performance depends mostly on memory bandwidth, prefill (prompt processing) depends more on GPU performance. DGX Spark has GPU performance comparable to desktop 5070 (or slightly less).
Depending on the port you choose for llama.cpp (I think 8000 is the default) you just need to add this to your URL. When Open WebUI is also running inside a container you will need to exchange the localhost with the IP address of your primary network interface of the Spark.
By “catastrophic” I mean: ~50 seconds of wall-clock time for each relatively short completion, in a simple batch setup.
Concretely:
Model:gpt-oss:20b running under Ollama
Endpoint:POST /api/generate with stream: false, max_tokens: 300
Usage pattern:
Python script looping over rows of an Excel file
For each row, it:
Builds a short prompt (1–2 sentences + instructions)
Calls the Ollama HTTP endpoint
Parses the response and writes the result back to another Excel file
Hardware / runtime: single Ollama container, GPU utilization around 89% during generation
Observed performance: about 50 seconds per answer from request to full response for this 20B model
For my use case (offline generation over a lot of rows), that latency per call is what I’m calling catastrophic: it makes the batch effectively unusable at scale, even though the GPU appears to be well loaded.
So my questions are:
Is this roughly what I should expect from gpt-oss:20b under Ollama?
Would running the same model via llama.cpp directly usually give noticeably better latency/throughput on the same hardware?
Or is there something obvious in this pattern (single-request loop, stream=false, etc.) that I should change to get reasonable performance?
I’m trying to run llama.cpp in Docker, but I’m confused about which image I should use for my setup.
I’m on an NVIDIA Spark instance, but the host architecture is ARM64:
Host platform: linux/arm64/v8
When I try to run the CUDA-enabled image:
ghcr.io/ggml-org/llama.cpp:server-cuda
Docker shows this warning:
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8)
The container still starts, but obviously CUDA is not usable, since amd64 → arm64 emulation cannot provide GPU acceleration. --gpus all is basically ignored inside the container.
🔍 What I’m trying to understand:
✔ Is there a CUDA-enabled llama.cpp Docker image that supports ARM64 + NVIDIA GPU?
✔ Or is this combination simply not supported at the moment?
✔ If GPU acceleration is possible on this hardware, which image should I use?
✔ If not, does that mean I must run the CPU-only image instead:
ghcr.io/ggml-org/llama.cpp:server
📌 System information:
(I can provide more if needed)
uname -m
→ arm64
docker run --gpus all ...
→ GPU is detected by Docker, but not usable inside the container
nvidia-smi
→ works on the host
🎯 Goal:
I want to run llama.cpp with GPU acceleration on my NVIDIA Spark machine — or at least confirm whether it is impossible with current images and architecture.
Any guidance (alternative images, custom builds, workarounds) would be greatly appreciated. Thanks!
There seems to be an open issue with the arm64 images.
So there are no current arm64 images currently. When testing with llama.cpp on Spark I’m currently building it myself and running it without container.
It is a tutorial to build from source, but it should hopefully not be too bad. Check the extra note I added further down to add curl support as well, which will allow it to grab models straight from huggingface.
Well, quick and dirty as I don’t have much time right now:
Install the essential build packages needed for building llama.cpp.
Create a working directory where you want to place the repository. I use src in my home dir for example.
Go into your src directory and clone the llama.cpp repo. Go into the newly created llama.cpp directory (repo).
Then build the binaries with the cmake commands mentioned in the second link. After successful compilation the llama-server binary is located in ~/src/llama.cpp/build-cuda/bin.
You don’t need to manually download ggufs, just use -hf parameter and specify Huggingface model name - it will download it automatically into ~/.cache/llama.cpp and run it. Another advantage of this approach is that it will automatically pick up a MM projector for vision models.
I’d recommend going with llama.cpp, but even for Ollama, even if it’s not running any GPU acceleration, your numbers are way off - looks like it is reloading model every time you send a request. Once model is loaded, with this model, response times will be in subsecond range for short prompts, and generation time will be pretty fast too. I don’t use this model, I use gpt-oss-120b instead, and even it has 60 t/s inference on Spark in llama.cpp.