GDX Spark is extremely slow on a short LLM test

carsten.giese · November 9, 2025, 9:40am

Hi, Community,
I really need some help on this topic to identify if either my hardware is faulty or to identify the bottleneck on my system.

After the inital setup and finishing all NVIDIA updates I installed the open-webui:ollama container as explained. Then I was running this command to see the first benchmark-result:

docker exec -it open-webui ollama run deepseek-coder:33b-instruct “Write a python function that calculates the factorial of a number.” --verbose

Here the statistic:
total duration: 24.560680687s
load duration: 10.807158232s
prompt eval count: 83 token(s)
prompt eval duration: 153.438776ms
prompt eval rate: 540.93 tokens/s
eval count: 165 token(s)
eval duration: 13.576963217s
eval rate: 12.15 tokens/s

Why is that system so slow???

Any help is more than welcome…

cosinus · November 9, 2025, 10:55am

I assume ollama is still not optimized for DGX Spark yet.

I would recommend to replace it with a current llama.cpp build/container which is very fast and has already been optimized for Spark with the support of NVIDIA.

see discussion on github:

I wouldn’t use deepseek-coder anymore. Try Qwen/Qwen3-Coder-30B-A3B-Instruct or openai/gpt-oss-20b or -120b instead. Those are MoE models which are very fast even on a Spark.

eugr · November 9, 2025, 11:37pm

Don’t use Ollama - llama.cpp is faster and gives you more flexibility in loading and configuring models.
Having said that, 12-13 t/s is about what I’d expect from 33B dense model on Spark.
Deepseek-coder is very old, there are better and faster models available. For Spark you want to use MOE models as they have very few active parameters, so they can be fast.

Spark has memory bandwidth of 273 GB/s. It’s about 2x slower than desktop RTX 5070, about 3x slower than desktop 4090 and about 6x slower than desktop 5090. Token generation (inference) performance depends mostly on memory bandwidth, prefill (prompt processing) depends more on GPU performance. DGX Spark has GPU performance comparable to desktop 5070 (or slightly less).

carsten.giese · November 10, 2025, 7:39am

Sounds great, but how to use that and integrate it so I can use it e.g. via Open WebUI???

cosinus · November 10, 2025, 5:30pm

llama.cpp comes with a builtin chat interface that is quite capable.

As for Open WebUI you can add a new connection for llama.cpp as it is OpenAI API compatible.

Depending on the port you choose for llama.cpp (I think 8000 is the default) you just need to add this to your URL. When Open WebUI is also running inside a container you will need to exchange the localhost with the IP address of your primary network interface of the Spark.

So http://192.168.0.123:8000/v1 (example) would be an URL you would have to enter in your connection.

See llama.cpp/docs/docker.md at master · ggml-org/llama.cpp · GitHub for using docker images. For example:

docker run --gpus all -p 8000:8000 -v $HOME/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda -s -hf ggml-org/gpt-oss-120b-GGUF --port 8000 --host 0.0.0.0 -c 0 --jinja

-hf for downloading a model directly from Hugging Face – in this case ggml-org/gpt-oss-120b-GGUF · Hugging Face

There also helper to ease the use of llama.cpp like llama swap - not sure if there are ready to use arm64 images for that already.

Still waiting for ASUS Germany to perform… so I can’t test it myself yet.

eugr · November 10, 2025, 7:06pm

There are, I’m using llama-swap on my Spark.

abull · November 10, 2025, 7:39pm

This DGX Spark playbook deploys Open WebUI with Ollama | DGX Spark and also integrates with NVIDIA Sync for remote connectivity.

eugr · November 10, 2025, 8:03pm

That’s the one OP was referring to.

deeduckme · December 3, 2025, 5:03pm

so if the documentation is released - olama should work correctly ? performance are catastrophic …

eugr · December 3, 2025, 6:15pm

I have no idea if Ollama performance is better now, it’s almost always better to just use llama.cpp directly.

Also, define “catastrophic”. What model you are using, what performance you are getting (and how you measure it), and what are your expectations?

ericcoco · December 4, 2025, 8:03am

Could you please specify which Docker image of lama.cpp swap you successfully used, because I see several versions?

deeduckme · December 4, 2025, 8:11am

By “catastrophic” I mean: ~50 seconds of wall-clock time for each relatively short completion, in a simple batch setup.

Concretely:

Model: gpt-oss:20b running under Ollama
Endpoint: POST /api/generate with stream: false, max_tokens: 300
Usage pattern:
- Python script looping over rows of an Excel file
- For each row, it:
  - Builds a short prompt (1–2 sentences + instructions)
  - Calls the Ollama HTTP endpoint
  - Parses the response and writes the result back to another Excel file
Hardware / runtime: single Ollama container, GPU utilization around 89% during generation
Observed performance: about 50 seconds per answer from request to full response for this 20B model

For my use case (offline generation over a lot of rows), that latency per call is what I’m calling catastrophic: it makes the batch effectively unusable at scale, even though the GPU appears to be well loaded.

So my questions are:

Is this roughly what I should expect from gpt-oss:20b under Ollama?
Would running the same model via llama.cpp directly usually give noticeably better latency/throughput on the same hardware?
Or is there something obvious in this pattern (single-request loop, stream=false, etc.) that I should change to get reasonable performance?

deeduckme · December 4, 2025, 9:15am

Hi everyone,

I’m trying to run llama.cpp in Docker, but I’m confused about which image I should use for my setup.
I’m on an NVIDIA Spark instance, but the host architecture is ARM64:

Host platform: linux/arm64/v8

When I try to run the CUDA-enabled image:

ghcr.io/ggml-org/llama.cpp:server-cuda

Docker shows this warning:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8)

The container still starts, but obviously CUDA is not usable, since amd64 → arm64 emulation cannot provide GPU acceleration.
--gpus all is basically ignored inside the container.

🔍 What I’m trying to understand:

✔ Is there a CUDA-enabled llama.cpp Docker image that supports ARM64 + NVIDIA GPU?
✔ Or is this combination simply not supported at the moment?
✔ If GPU acceleration is possible on this hardware, which image should I use?
✔ If not, does that mean I must run the CPU-only image instead:

ghcr.io/ggml-org/llama.cpp:server

📌 System information:

(I can provide more if needed)

uname -m
→ arm64

docker run --gpus all ...
→ GPU is detected by Docker, but not usable inside the container

nvidia-smi
→ works on the host

🎯 Goal:

I want to run llama.cpp with GPU acceleration on my NVIDIA Spark machine — or at least confirm whether it is impossible with current images and architecture.

Any guidance (alternative images, custom builds, workarounds) would be greatly appreciated. Thanks!

cosinus · December 4, 2025, 9:38am

There seems to be an open issue with the arm64 images.

github.com/ggml-org/llama.cpp

[Tracker] Docker build fails on CI for arm64

opened 03:01PM - 15 Feb 25 UTC

ngxson

bug

Tracker for upstream issue: https://github.com/docker/build-push-action/issues/1…309 At the time of writing this issue, this has been failing for the past 3 days: https://github.com/ggml-org/llama.cpp/actions/workflows/docker.yml It was fixed in https://github.com/ggml-org/llama.cpp/pull/11472 but now the problem came back. CC @ggerganov @slaren for visibility

So there are no current arm64 images currently. When testing with llama.cpp on Spark I’m currently building it myself and running it without container.

See also Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578 · GitHub for instructions on who to build.

Bleeding edge technology… by the time of writing I didn’t had my GB10 and couldn’t test it myself.

deeduckme · December 4, 2025, 9:41am

you have a guide to deploy that ?without docker ?

RazielAU · December 4, 2025, 10:25am

See if my instructions here help you out:

It is a tutorial to build from source, but it should hopefully not be too bad. Check the extra note I added further down to add curl support as well, which will allow it to grab models straight from huggingface.

cosinus · December 4, 2025, 10:51am

Well, quick and dirty as I don’t have much time right now:

Install the essential build packages needed for building llama.cpp.
Create a working directory where you want to place the repository. I use src in my home dir for example.
Go into your src directory and clone the llama.cpp repo. Go into the newly created llama.cpp directory (repo).
Then build the binaries with the cmake commands mentioned in the second link. After successful compilation the llama-server binary is located in ~/src/llama.cpp/build-cuda/bin.

sudo apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1

mkdir src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j

Download the desired ggufs like ggml-org/gpt-oss-120b-GGUF · Hugging Face and run:

~/src/llama.cpp/build-cuda/bin/llama-server -m gpt-oss-120b-mxfp4-00001-of-00003.gguf - -fa 1 -ngl 999 -c 0 -a ggml-org/gpt-oss-120b --jinja --no-mmap

You might need to adjust the path for the ggufs. I place them in a models directory in my home (argument -m). For more infos on the build process see llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub or the discussion I mentioned above.

I hope that helps you to get started.

EDIT: Or you follow RazielAU’s instructions. Didn’t saw that he answered already. :-D

eugr · December 4, 2025, 6:35pm

You don’t need to manually download ggufs, just use -hf parameter and specify Huggingface model name - it will download it automatically into ~/.cache/llama.cpp and run it. Another advantage of this approach is that it will automatically pick up a MM projector for vision models.

eugr · December 4, 2025, 6:38pm

I’d recommend going with llama.cpp, but even for Ollama, even if it’s not running any GPU acceleration, your numbers are way off - looks like it is reloading model every time you send a request. Once model is loaded, with this model, response times will be in subsecond range for short prompts, and generation time will be pretty fast too. I don’t use this model, I use gpt-oss-120b instead, and even it has 60 t/s inference on Spark in llama.cpp.

Topic		Replies	Views
Very poor performance with Ollama on DGX Spark – looking for help DGX Spark / GB10 Projects	5	297	December 12, 2025
DGX Spark vs AMD Strix Halo DGX Spark / GB10 llama	2	1815	October 23, 2025
When we install an LLM model and start a chat session, the response speed becomes extremely slow DGX Spark / GB10 llama	1	90	December 6, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	25	975	December 12, 2025
Reviews are coming in DGX Spark / GB10	27	5070	November 24, 2025
Model Orchestration and Deployment DGX Spark / GB10 nim	4	212	November 24, 2025
Setting up vLLM, SGLang or TensorRT on two DGX Sparks DGX Spark / GB10	16	237	December 7, 2025
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	1924	December 9, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	10	297	December 8, 2025
TRT LLM for Inference - two Sparks example is VERY slow DGX Spark / GB10	6	428	October 23, 2025

GDX Spark is extremely slow on a short LLM test

🔍 What I’m trying to understand:

📌 System information:

🎯 Goal:

Related topics