LLM inference results?

Hey,

I am considering acquiring a Xavier agx second hand and use it for LLM inference. Does anyone have any benchmarks? The only thing I found so far was in this forum (llama3.1 8b, about 8.4 tps tg with 13.4 tps pp). Any other benchmarks or experiences would be appreciated!

Hi,

We don’t have LLM benchmark data for Xavier.
Maybe other user can share their experience.

But we do have several scores of Orin and Thor for your reference:

Thanks.

I ran llama.cpp + gpt‑oss‑20b‑Q4_K_M.gguf on the NVIDIA Jetson AGX Xavier, and the test results are as follows:
command:
~/llama.cpp/build/bin/llama-server -m “$selected_gguf” --host 0.0.0.0 --port 1234 -c 12288 -b 256 -ub 128 --flash-attn 0 --no-warmup --jinja -a “$(basename “$selected_gguf” .gguf)”

Context: 1770/12288 (14%) Output: 1681/∞ 12.9 tokens/sec
Conclusion: The context can only be this large; adding more will cause the program to crash.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.