Skip to content

Conversation

@ikawrakow
Copy link
Owner

TG is about the same. PP-512 comparison between main and this PR for LLaMA-3.1-8B on a Ryzen-5975WX (AVX2) and a Ryzen-7950X (Zen4)

model backend threads test t/s (main) t/s (PR) Speedup
llama 8B Q4_K_S AVX2 32 pp512 291.90 ± 0.64 327.98 ± 0.51 1.124
llama 8B Q5_K_S AVX2 32 pp512 273.59 ± 0.37 302.13 ± 0.61 1.104
llama 8B Q4_K_S Zen4 16 pp512 258.78 ± 1.05 267.69 ± 0.31 1.034
llama 8B Q5_K_S Zen4 16 pp512 246.19 ± 0.65 249.12 ± 0.42 1.012

The improvement on Zen4 is very minor. The benefit there is bloat reduction as I'm now reusing the same implementation as AVX2.

Iwan Kawrakow added 7 commits January 29, 2025 15:32
We now arrive at PP-512 = 328 t/s for LLaMA-3.1-8B on a
Ryzen-5975WX CPU, up from 291 t/s when I last measured
on 3c5f872.
With FA and Q8_0 K-cache we get to 339.5 t/s.
We arrive at 302 t/s for LLaMA-3.1-8B on a Ryzen-5975WX CPU,
up from 273 t/s.
After the changes I made to AVX2, it ends up being slightly faster
compared to what I had for Zen4.
@ikawrakow ikawrakow merged commit 2e6b523 into main Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants