GDX Spark is extremely slow on a short LLM test

Hi, Community,
I really need some help on this topic to identify if either my hardware is faulty or to identify the bottleneck on my system.

After the inital setup and finishing all NVIDIA updates I installed the open-webui:ollama container as explained. Then I was running this command to see the first benchmark-result:

docker exec -it open-webui ollama run deepseek-coder:33b-instruct “Write a python function that calculates the factorial of a number.” --verbose

Here the statistic:
total duration: 24.560680687s
load duration: 10.807158232s
prompt eval count: 83 token(s)
prompt eval duration: 153.438776ms
prompt eval rate: 540.93 tokens/s
eval count: 165 token(s)
eval duration: 13.576963217s
eval rate: 12.15 tokens/s

Why is that system so slow???

Any help is more than welcome…

I assume ollama is still not optimized for DGX Spark yet.

I would recommend to replace it with a current llama.cpp build/container which is very fast and has already been optimized for Spark with the support of NVIDIA.

see discussion on github:

I wouldn’t use deepseek-coder anymore. Try Qwen/Qwen3-Coder-30B-A3B-Instruct or openai/gpt-oss-20b or -120b instead. Those are MoE models which are very fast even on a Spark.

  1. Don’t use Ollama - llama.cpp is faster and gives you more flexibility in loading and configuring models.
  2. Having said that, 12-13 t/s is about what I’d expect from 33B dense model on Spark.
  3. Deepseek-coder is very old, there are better and faster models available. For Spark you want to use MOE models as they have very few active parameters, so they can be fast.

Spark has memory bandwidth of 273 GB/s. It’s about 2x slower than desktop RTX 5070, about 3x slower than desktop 4090 and about 6x slower than desktop 5090. Token generation (inference) performance depends mostly on memory bandwidth, prefill (prompt processing) depends more on GPU performance. DGX Spark has GPU performance comparable to desktop 5070 (or slightly less).

Sounds great, but how to use that and integrate it so I can use it e.g. via Open WebUI???

llama.cpp comes with a builtin chat interface that is quite capable.

As for Open WebUI you can add a new connection for llama.cpp as it is OpenAI API compatible.

Depending on the port you choose for llama.cpp (I think 8000 is the default) you just need to add this to your URL. When Open WebUI is also running inside a container you will need to exchange the localhost with the IP address of your primary network interface of the Spark.

So http://192.168.0.123:8000/v1 (example) would be an URL you would have to enter in your connection.

See llama.cpp/docs/docker.md at master · ggml-org/llama.cpp · GitHub for using docker images. For example:

docker run --gpus all -p 8000:8000 -v $HOME/models:/models ghcr.io/ggml-org/llama.cpp:full-cuda -s -hf ggml-org/gpt-oss-120b-GGUF --port 8000 --host 0.0.0.0 -c 0 --jinja

-hf for downloading a model directly from Hugging Face – in this case ggml-org/gpt-oss-120b-GGUF · Hugging Face

There also helper to ease the use of llama.cpp like llama swap - not sure if there are ready to use arm64 images for that already.

Still waiting for ASUS Germany to perform… so I can’t test it myself yet.

There are, I’m using llama-swap on my Spark.

This DGX Spark playbook deploys Open WebUI with Ollama | DGX Spark and also integrates with NVIDIA Sync for remote connectivity.

That’s the one OP was referring to.

1 Like

so if the documentation is released - olama should work correctly ? performance are catastrophic …

I have no idea if Ollama performance is better now, it’s almost always better to just use llama.cpp directly.

Also, define “catastrophic”. What model you are using, what performance you are getting (and how you measure it), and what are your expectations?

Could you please specify which Docker image of lama.cpp swap you successfully used, because I see several versions?

By “catastrophic” I mean: ~50 seconds of wall-clock time for each relatively short completion, in a simple batch setup.

Concretely:

  • Model: gpt-oss:20b running under Ollama

  • Endpoint: POST /api/generate with stream: false, max_tokens: 300

  • Usage pattern:

    • Python script looping over rows of an Excel file

    • For each row, it:

      • Builds a short prompt (1–2 sentences + instructions)

      • Calls the Ollama HTTP endpoint

      • Parses the response and writes the result back to another Excel file

  • Hardware / runtime: single Ollama container, GPU utilization around 89% during generation

  • Observed performance: about 50 seconds per answer from request to full response for this 20B model

For my use case (offline generation over a lot of rows), that latency per call is what I’m calling catastrophic: it makes the batch effectively unusable at scale, even though the GPU appears to be well loaded.

So my questions are:

  • Is this roughly what I should expect from gpt-oss:20b under Ollama?

  • Would running the same model via llama.cpp directly usually give noticeably better latency/throughput on the same hardware?

  • Or is there something obvious in this pattern (single-request loop, stream=false, etc.) that I should change to get reasonable performance?

Hi everyone,

I’m trying to run llama.cpp in Docker, but I’m confused about which image I should use for my setup.
I’m on an NVIDIA Spark instance, but the host architecture is ARM64:

Host platform: linux/arm64/v8

When I try to run the CUDA-enabled image:

ghcr.io/ggml-org/llama.cpp:server-cuda

Docker shows this warning:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8)

The container still starts, but obviously CUDA is not usable, since amd64 → arm64 emulation cannot provide GPU acceleration.
--gpus all is basically ignored inside the container.


🔍 What I’m trying to understand:

Is there a CUDA-enabled llama.cpp Docker image that supports ARM64 + NVIDIA GPU?
Or is this combination simply not supported at the moment?
If GPU acceleration is possible on this hardware, which image should I use?
If not, does that mean I must run the CPU-only image instead:

ghcr.io/ggml-org/llama.cpp:server


📌 System information:

(I can provide more if needed)

uname -m
→ arm64

docker run --gpus all ...
→ GPU is detected by Docker, but not usable inside the container

nvidia-smi
→ works on the host


🎯 Goal:

I want to run llama.cpp with GPU acceleration on my NVIDIA Spark machine — or at least confirm whether it is impossible with current images and architecture.

Any guidance (alternative images, custom builds, workarounds) would be greatly appreciated. Thanks!

There seems to be an open issue with the arm64 images.

So there are no current arm64 images currently. When testing with llama.cpp on Spark I’m currently building it myself and running it without container.

See also Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578 · GitHub for instructions on who to build.

Bleeding edge technology… by the time of writing I didn’t had my GB10 and couldn’t test it myself.

1 Like

you have a guide to deploy that ?without docker ?

See if my instructions here help you out:

It is a tutorial to build from source, but it should hopefully not be too bad. Check the extra note I added further down to add curl support as well, which will allow it to grab models straight from huggingface.

Well, quick and dirty as I don’t have much time right now:

  1. Install the essential build packages needed for building llama.cpp.

  2. Create a working directory where you want to place the repository. I use src in my home dir for example.

  3. Go into your src directory and clone the llama.cpp repo. Go into the newly created llama.cpp directory (repo).

  4. Then build the binaries with the cmake commands mentioned in the second link. After successful compilation the llama-server binary is located in ~/src/llama.cpp/build-cuda/bin.

sudo apt-get install -y build-essential cmake python3 python3-pip git libcurl4-openssl-dev libgomp1

mkdir src
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build-cuda -DGGML_CUDA=ON
cmake --build build-cuda -j

Download the desired ggufs like ggml-org/gpt-oss-120b-GGUF · Hugging Face and run:

~/src/llama.cpp/build-cuda/bin/llama-server -m gpt-oss-120b-mxfp4-00001-of-00003.gguf - -fa 1 -ngl 999 -c 0 -a ggml-org/gpt-oss-120b --jinja --no-mmap

You might need to adjust the path for the ggufs. I place them in a models directory in my home (argument -m). For more infos on the build process see llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub or the discussion I mentioned above.

I hope that helps you to get started.

EDIT: Or you follow RazielAU’s instructions. Didn’t saw that he answered already. :-D

2 Likes

You don’t need to manually download ggufs, just use -hf parameter and specify Huggingface model name - it will download it automatically into ~/.cache/llama.cpp and run it. Another advantage of this approach is that it will automatically pick up a MM projector for vision models.

1 Like

I’d recommend going with llama.cpp, but even for Ollama, even if it’s not running any GPU acceleration, your numbers are way off - looks like it is reloading model every time you send a request. Once model is loaded, with this model, response times will be in subsecond range for short prompts, and generation time will be pretty fast too. I don’t use this model, I use gpt-oss-120b instead, and even it has 60 t/s inference on Spark in llama.cpp.