cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON
after “cmake --build build --config Release -j 20” i am getting
[ 28%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-q6_k.cu.o
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 1008; error : Instruction ‘mma with block scale’ not supported on .target ‘sm_121’
..truncated….
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.block_scale’ not supported on .target ‘sm_121’
ptxas /tmp/tmpxft_00009a8f_00000000-7_mmq-instance-mxfp4.ptx, line 123447; error : Feature ‘.scale_vec::2X’ not supported on .target ‘sm_121’
ptxas fatal : Ptx assembly aborted due to errors
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:1502: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/template-instances/mmq-instance-mxfp4.cu.o] Error 255
gmake[2]: *** Waiting for unfinished jobs…
gmake[1]: *** [CMakeFiles/Makefile2:1881: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/all] Error 2
gmake: *** [Makefile:146: all] Error 2
This is on a freshly factory recover from today and also on a brand new unit, so 2 units with the same and it worked a few days ago so completely at a loss here..
It’s not you, they recently added support for native MXFP4 to llama.cpp for Blackwell GPUs. This was a problem a couple of days ago, then it was fixed, but I just had a look now, and it seems like the problem has been reintroduced.
As a workaround, you can disable native MXFP4 support by passing -DGGML_NATIVE=OFF, which should get you building. It’s mostly useful for prompt processing anyway, and doesn’t seem to do all that much for token generation on GPT-OSS:120b.
On DGX Spark, we’re technically compute capability 121, but this bug effected more than what was raised in the original issue, so it didn’t just affect 120. Which makes me think I shouldn’t have explicitly called it a 121 issue, as it likely affects others too, but I’ve already raised it, so I’m just going to leave it as it is.
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=OFF
CMAKE_BUILD_TYPE=Release
– Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
– CMAKE_SYSTEM_PROCESSOR: aarch64
– GGML_SYSTEM_ARCH: ARM
– Including CPU backend
– ARM detected
– Checking for ARM features using flags:
– Adding CPU backend variant ggml-cpu:
– CUDA Toolkit found
CMake Error at ggml/src/ggml-cuda/CMakeLists.txt:64 (message):
Compute capability 120 used, use 120a or 120f for Blackwell specific
optimizations
With that said, they seem to be messing around in that area at the moment, so it wouldn’t surprise me if it was a problem that has since been resolved.
Also, make sure you always delete the build directory, probably not strictly required, but I do that whenever compiling llama.cpp, it only takes a couple of minutes to build since we’ve got those 20 CPU cores on the DGX Spark.
The way I do it is that I have a llama.cpp directory, and inside the code is checked out into a ‘code’ directory. I then have a bin directory at the same level and a models directory, then I have an update script that looks like this:
As you can see I have various variants of the cmake command, I don’t know if this still makes sense, but if you’re not going to distribute the binaries, you might be best off using 121a-real which build highly optimised, but architecture specific binaries (so it won’t run on anything else). Otherwise the default where you don’t specific anything should work.
As for RPC, nothing special, just make sure you use -c on receiver side for caching and use IP that belongs to ConnectX port. Don’t expect any speed increases with it, the latency on TCP/IP kills performance, and llama.cpp can’t do RDMA. In any case, even if it supported RDMA, you’d not get any performance gains compared to single node - maybe on PP only, as llama.cpp can’t do tensor split.
So if you want to take full advantage of cluster setup, it’s better to run models in vLLM with tensor split.