I am considering acquiring a Xavier agx second hand and use it for LLM inference. Does anyone have any benchmarks? The only thing I found so far was in this forum (llama3.1 8b, about 8.4 tps tg with 13.4 tps pp). Any other benchmarks or experiences would be appreciated!
I ran llama.cpp + gpt‑oss‑20b‑Q4_K_M.gguf on the NVIDIA Jetson AGX Xavier, and the test results are as follows:
command: ~/llama.cpp/build/bin/llama-server -m “$selected_gguf” --host 0.0.0.0 --port 1234 -c 12288 -b 256 -ub 128 --flash-attn 0 --no-warmup --jinja -a “$(basename “$selected_gguf” .gguf)”
Context: 1770/12288 (14%) Output: 1681/∞ 12.9 tokens/sec
Conclusion: The context can only be this large; adding more will cause the program to crash.