Repack also experts #210

ikawrakow · 2025-02-19T08:00:41Z

When I implemented run time repacking, I required the tensor to be 2D to be eligible for repacking, I guess to simplify the code. But I forgot about MoE models, where expert weights are in 3D tensors.

This PR fixes that. This leads to very significant performance gains. E.g., for DeepSeek-Lite quantized with IQ4_XS, we get PP-512 = 545 t/s on the main branch, and PP-512 = 677 t/s with this PR when using run time repacking.

Repack also experts

7d020d8

ikawrakow merged commit 047ba89 into main Feb 19, 2025

This was referenced Feb 19, 2025

Q8_KV: 8-bit quantization type targeting the KV cache #208

Merged

Does the iqk_mul_mat.cpp support 1.58-bit quantization model? #209

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repack also experts #210

Repack also experts #210

Uh oh!

ikawrakow commented Feb 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Repack also experts #210

Repack also experts #210

Uh oh!

Conversation

ikawrakow commented Feb 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants