Compiling llama.cpp

I am having the time of my life here.. can get for the life of me to compile llama.cpp following Tutorial: Build llama.cpp from source and run Qwen3 235B

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
after “cmake --build build --config Release -j 20” i am getting

[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 1008; error : Instruction ‘mma with block scale’ not supported on .target ‘sm_121’

..truncated….

ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.block_scale’ not supported on .target ‘sm_121’
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.scale_vec::2X’ not supported on .target ‘sm_121’
ptxas fatal : Ptx assembly aborted due to errors
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:1502: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-mxfp4.cu.o] Error 255
gmake[2]: *** Waiting for unfinished jobs…
gmake[1]: *** [CMakeFiles/Makefile2:1881: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2

This is on a freshly factory recover from today and also on a brand new unit, so 2 units with the same and it worked a few days ago so completely at a loss here..

Any ideas?

It’s not you, they recently added support for native MXFP4 to llama.cpp for Blackwell GPUs. This was a problem a couple of days ago, then it was fixed, but I just had a look now, and it seems like the problem has been reintroduced.

As a workaround, you can disable native MXFP4 support by passing -DGGML_NATIVE=OFF, which should get you building. It’s mostly useful for prompt processing anyway, and doesn’t seem to do all that much for token generation on GPT-OSS:120b.

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF

As for the issue being reintroduced, this is pretty disappointing.

1 Like

I raised an issue for it: Compile bug: CUDA build for mmq breaks for compute capability 121 · Issue #18425 · ggml-org/llama.cpp · GitHub

For more information on the original issue, refer to: Compile bug: CUDA build for mmq breaks for compute capability 120 · Issue #18363 · ggml-org/llama.cpp · GitHub

On DGX Spark, we’re technically compute capability 121, but this bug effected more than what was raised in the original issue, so it didn’t just affect 120. Which makes me think I shouldn’t have explicitly called it a 121 issue, as it likely affects others too, but I’ve already raised it, so I’m just going to leave it as it is.

3 Likes

Thanks Raziel, got another error

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF
CMAKE_BUILD_TYPE=Release
– Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
– CMAKE_SYSTEM_PROCESSOR: aarch64
– GGML_SYSTEM_ARCH: ARM
– Including CPU backend
– ARM detected
– Checking for ARM features using flags:
– Adding CPU backend variant ggml-cpu:
– CUDA Toolkit found
CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:64 (message):
Compute capability 120 used, use 120a or 120f for Blackwell specific
optimizations

So ended up using:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF -DCMAKE_CUDA_ARCHITECTURES=120f
and then..
cmake --build build --config Release -j 20

You shouldn’t need to specify either, I just tested it on the latest code, and it builds without a problem at the moment with this command:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j 20

With that said, they seem to be messing around in that area at the moment, so it wouldn’t surprise me if it was a problem that has since been resolved.

Also, make sure you always delete the build directory, probably not strictly required, but I do that whenever compiling llama.cpp, it only takes a couple of minutes to build since we’ve got those 20 CPU cores on the DGX Spark.

The way I do it is that I have a llama.cpp directory, and inside the code is checked out into a ‘code’ directory. I then have a bin directory at the same level and a models directory, then I have an update script that looks like this:

cd code &&
rm -rf build &&
git pull &&
#cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF &&
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real &&
#cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON &&
cmake --build build --config Release -j 20 &&
echo Copying binary files... &&
cd .. &&
rm -f bin/* &&
cp code/build/bin/l* bin &&
echo Done

As you can see I have various variants of the cmake command, I don’t know if this still makes sense, but if you’re not going to distribute the binaries, you might be best off using 121a-real which build highly optimised, but architecture specific binaries (so it won’t run on anything else). Otherwise the default where you don’t specific anything should work.

I build with these options (to add RPC backend so it could work on dual sparks):

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CURL=ON -DGGML_RPC=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
cmake --build build --config Release -j

Blackwell optimizations resulted in increased prompt processing for gpt-oss-120b: went from ~1900 t/s to ~2400 t/s.

2 Likes

@eugr
funny, i did the same RPC setup today.

anything special on the llama-server switch side like ngl, tensor_split, cn, np, etc?

and Wow 2400 t/s!!

Which specific model and how are you testing and getting those t/s?

gpt-oss-120b, no RPC, just single node - this is prefill speed, generation is still 58 t/s:

build/bin/llama-bench -m ~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 -d 0,4096,8192,16384,32768 -p 2048 -n 32 -ub 2048 -mmp 0
model size params backend test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 2438.11 ± 13.72
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 57.81 ± 0.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d4096 2294.32 ± 12.61
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d4096 54.68 ± 0.52
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d8192 2149.21 ± 8.88
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d8192 51.75 ± 0.56
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d16384 1824.37 ± 8.93
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d16384 48.29 ± 0.21
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA pp2048 @ d32768 1415.53 ± 9.85
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA tg32 @ d32768 41.42 ± 0.17

As for RPC, nothing special, just make sure you use -c on receiver side for caching and use IP that belongs to ConnectX port. Don’t expect any speed increases with it, the latency on TCP/IP kills performance, and llama.cpp can’t do RDMA. In any case, even if it supported RDMA, you’d not get any performance gains compared to single node - maybe on PP only, as llama.cpp can’t do tensor split.

So if you want to take full advantage of cluster setup, it’s better to run models in vLLM with tensor split.

2 Likes

My result. Asus ascent GX 10.

dgx@gx10-4323:~$ ~/Downloads/llama.cpp/build-gpu/bin/llama-bench -m “$HOME/models/gpt-oss-120b-gguf/gpt-oss-120b-mxfp4-00001-of-00003.gguf” -fa 1 -d 0,4096,8192,16384,32768 -ngl 999 -p 2048 -n 32 -ub 2048 -mmp 0

Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes

model size params backend ngl n_ubatch fa mmap test t/s
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 pp2048 2478.28 ± 10.53
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 tg32 59.16 ± 0.27
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 pp2048 @ d4096 2323.76 ± 7.33
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 tg32 @ d4096 54.92 ± 0.25
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 pp2048 @ d8192 2191.98 ± 6.83
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 tg32 @ d8192 52.08 ± 0.35
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 pp2048 @ d16384 1884.22 ± 8.08
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 tg32 @ d16384 48.47 ± 0.20
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 pp2048 @ d32768 1442.59 ± 8.86
gpt-oss 120B MXFP4 MoE 59.02 GiB 116.83 B CUDA 999 2048 1 0 tg32 @ d32768 41.40 ± 0.23

1 Like