Compiling llama.cpp

paul.aviles · December 28, 2025, 8:39am

I am having the time of my life here.. can get for the life of me to compile llama.cpp following Tutorial: Build llama.cpp from source and run Qwen3 235B

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
after “cmake --build build --config Release -j 20” i am getting

[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 1008; error : Instruction ‘mma with block scale’ not supported on .target ‘sm_121’

..truncated….

ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.block_scale’ not supported on .target ‘sm_121’
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.scale_vec::2X’ not supported on .target ‘sm_121’
ptxas fatal : Ptx assembly aborted due to errors
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:1502: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-mxfp4.cu.o] Error 255
gmake[2]: *** Waiting for unfinished jobs…
gmake[1]: *** [CMakeFiles/Makefile2:1881: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2

This is on a freshly factory recover from today and also on a brand new unit, so 2 units with the same and it worked a few days ago so completely at a loss here..

Any ideas?

RazielAU · December 28, 2025, 8:54am

It’s not you, they recently added support for native MXFP4 to llama.cpp for Blackwell GPUs. This was a problem a couple of days ago, then it was fixed, but I just had a look now, and it seems like the problem has been reintroduced.

As a workaround, you can disable native MXFP4 support by passing -DGGML_NATIVE=OFF, which should get you building. It’s mostly useful for prompt processing anyway, and doesn’t seem to do all that much for token generation on GPT-OSS:120b.

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF

As for the issue being reintroduced, this is pretty disappointing.

RazielAU · December 28, 2025, 9:20am

I raised an issue for it: Compile bug: CUDA build for mmq breaks for compute capability 121 · Issue #18425 · ggml-org/llama.cpp · GitHub

For more information on the original issue, refer to: Compile bug: CUDA build for mmq breaks for compute capability 120 · Issue #18363 · ggml-org/llama.cpp · GitHub

On DGX Spark, we’re technically compute capability 121, but this bug effected more than what was raised in the original issue, so it didn’t just affect 120. Which makes me think I shouldn’t have explicitly called it a 121 issue, as it likely affects others too, but I’ve already raised it, so I’m just going to leave it as it is.

paul.aviles · December 28, 2025, 4:01pm

Thanks Raziel, got another error

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF
CMAKE_BUILD_TYPE=Release
– Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
– CMAKE_SYSTEM_PROCESSOR: aarch64
– GGML_SYSTEM_ARCH: ARM
– Including CPU backend
– ARM detected
– Checking for ARM features using flags:
– Adding CPU backend variant ggml-cpu:
– CUDA Toolkit found
CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:64 (message):
Compute capability 120 used, use 120a or 120f for Blackwell specific
optimizations

So ended up using:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF -DCMAKE_CUDA_ARCHITECTURES=120f
and then..
cmake --build build --config Release -j 20

RazielAU · December 29, 2025, 1:27am

You shouldn’t need to specify either, I just tested it on the latest code, and it builds without a problem at the moment with this command:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j 20

With that said, they seem to be messing around in that area at the moment, so it wouldn’t surprise me if it was a problem that has since been resolved.

Also, make sure you always delete the build directory, probably not strictly required, but I do that whenever compiling llama.cpp, it only takes a couple of minutes to build since we’ve got those 20 CPU cores on the DGX Spark.

The way I do it is that I have a llama.cpp directory, and inside the code is checked out into a ‘code’ directory. I then have a bin directory at the same level and a models directory, then I have an update script that looks like this:

cd code &&
rm -rf build &&
git pull &&
#cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF &&
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real &&
#cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON &&
cmake --build build --config Release -j 20 &&
echo Copying binary files... &&
cd .. &&
rm -f bin/* &&
cp code/build/bin/l* bin &&
echo Done

As you can see I have various variants of the cmake command, I don’t know if this still makes sense, but if you’re not going to distribute the binaries, you might be best off using 121a-real which build highly optimised, but architecture specific binaries (so it won’t run on anything else). Otherwise the default where you don’t specific anything should work.

eugr · December 29, 2025, 1:50am

I build with these options (to add RPC backend so it could work on dual sparks):

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Blackwell optimizations resulted in increased prompt processing for gpt-oss-120b: went from ~1900 t/s to ~2400 t/s.

paul.aviles · December 29, 2025, 4:01am

@eugr
funny, i did the same RPC setup today.

anything special on the llama-server switch side like ngl, tensor_split, cn, np, etc?

and Wow 2400 t/s!!

Which specific model and how are you testing and getting those t/s?

eugr · December 29, 2025, 7:39am

gpt-oss-120b, no RPC, just single node - this is prefill speed, generation is still 58 t/s:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0

model	size	params	backend	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048	2438.11 ± 13.72
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32	57.81 ± 0.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d4096	2294.32 ± 12.61
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d4096	54.68 ± 0.52
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d8192	2149.21 ± 8.88
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d8192	51.75 ± 0.56
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d16384	1824.37 ± 8.93
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d16384	48.29 ± 0.21
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	pp2048 @ d32768	1415.53 ± 9.85
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	tg32 @ d32768	41.42 ± 0.17

As for RPC, nothing special, just make sure you use -c on receiver side for caching and use IP that belongs to ConnectX port. Don’t expect any speed increases with it, the latency on TCP/IP kills performance, and llama.cpp can’t do RDMA. In any case, even if it supported RDMA, you’d not get any performance gains compared to single node - maybe on PP only, as llama.cpp can’t do tensor split.

So if you want to take full advantage of cluster setup, it’s better to run models in vLLM with tensor split.

siertum · December 30, 2025, 9:10am

My result. Asus ascent GX 10.

dgx@gx10-4323:~$ ~/Downloads/llama.cpp/build-gpu/bin/llama-bench -m “$HOME/models/gpt-oss-120b-gguf/gpt-oss-120b-mxfp4-00001-of-00003.gguf” -fa 1 -d 0,4096,8192,16384,32768 -ngl 999 -p 2048 -n 32 -ub 2048 -mmp 0

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	pp2048	2478.28 ± 10.53
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	tg32	59.16 ± 0.27
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	pp2048 @ d4096	2323.76 ± 7.33
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	tg32 @ d4096	54.92 ± 0.25
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	pp2048 @ d8192	2191.98 ± 6.83
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	tg32 @ d8192	52.08 ± 0.35
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	pp2048 @ d16384	1884.22 ± 8.08
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	tg32 @ d16384	48.47 ± 0.20
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	pp2048 @ d32768	1442.59 ± 8.86
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	2048	1	tg32 @ d32768	41.40 ± 0.23

Topic		Replies	Views
Llama.cpp experimental native mxfp4 support for blackwell PR DGX Spark / GB10 llama	2	170	December 25, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	25	1607	December 12, 2025
Building llama.cpp container images for Spark/GB10 DGX Spark / GB10 Projects cuda , llama	7	367	December 18, 2025
Failed Llama.cpp inference on AGX Xavier: Need to downgrade L4T from 35.6.3 to 35.6.2 Jetson AGX Xavier llama	4	86	November 18, 2025
Installing llama.cpp Jetson Orin NX cuda , llama	5	58	December 12, 2025
Running Llama-3.1-8B-FP4 get triton error. Value 'sm_121a' is not defined for option 'gpu-name' DGX Spark / GB10 inference-server-triton , llama	2	407	October 24, 2025
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	32	467	December 22, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	809	December 25, 2025
Help running Nemotron 3 Nano 30B-A3B-FP8 on DGX Spark (GB10) DGX Spark / GB10 spark , nim , nemotron	33	1162	December 31, 2025
Any tips on running Magistral? DGX Spark / GB10 cuda , llama	11	233	December 3, 2025

Compiling llama.cpp

Related topics