[x86] matmulnbit x64 kernel for 8bits#24491
Conversation
liqunfu
left a comment
There was a problem hiding this comment.
LGTM, it will be better if you add performance results, either mlas benchmark or even better ort-genai benchmark. ort-genai benchmark is preferred because I found mlas benchmark tends to show higher improvement that does not exist with genai.
|
"For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down." does this issue exist with 4bit non-vnni? |
4bit does not have this issue. it is caused by (i8 * i8) * 2. the result is put in i16. (i4 * i4) * 2 can fit in i16 |
### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24487) - (#24466) - (#24493) - (#24484) - (#24494) - (#24489) - (#24504) - (#24510) - (#24456) - (#24537) - (#24501) - (#24519) - (#24513) - (#24539) - (#24514) - (#24542) - (#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (#24491) - (#24509) - (#24564) --------- Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: minfhong-quic <[email protected]> Co-authored-by: minfhong-quic <[email protected]> Co-authored-by: Justin Chu <[email protected]> Co-authored-by: Prathik Rao <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Gaurav Garg <[email protected]> Co-authored-by: iraut <[email protected]> Co-authored-by: Hrishikesh Manohar <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]> Co-authored-by: xhcao <[email protected]>
### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (microsoft#24487) - (microsoft#24466) - (microsoft#24493) - (microsoft#24484) - (microsoft#24494) - (microsoft#24489) - (microsoft#24504) - (microsoft#24510) - (microsoft#24456) - (microsoft#24537) - (microsoft#24501) - (microsoft#24519) - (microsoft#24513) - (microsoft#24539) - (microsoft#24514) - (microsoft#24542) - (microsoft#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (microsoft#24491) - (microsoft#24509) - (microsoft#24564) --------- Co-authored-by: vraspar <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: minfhong-quic <[email protected]> Co-authored-by: minfhong-quic <[email protected]> Co-authored-by: Justin Chu <[email protected]> Co-authored-by: Prathik Rao <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Gaurav Garg <[email protected]> Co-authored-by: iraut <[email protected]> Co-authored-by: Hrishikesh Manohar <[email protected]> Co-authored-by: Maximilian Müller <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: kunal-vaishnavi <[email protected]> Co-authored-by: xhcao <[email protected]>
### Description Add 8bits support for matmulnbits on x86 __AVX512 VNNI__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** | | 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** | | 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** | | 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** | | 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** | | 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** | | 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** | | 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** | __AVX512__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** | | 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** | | 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** | | 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** | | 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** | | 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** | | 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** | | 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** | Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down. For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down. ### Motivation and Context MatMul4Bits model has repetition issue. 6b model resolved this issue.
### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24491) - (#24509) - (#24564) - (#24574) - (#24582) - (#24584) - (#24568) - (#24587) - (#24563) - (#24592) - (#24526) - (#24552) - (#24588) - (#24605) - (#24606) --------- Co-authored-by: Jing Fang <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Scott McKay <[email protected]> Co-authored-by: Mark Schofield <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Ashwath Shankarnarayan <[email protected]> Co-authored-by: saurabh <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Hector Li <[email protected]>
### Description Add 8bits support for matmulnbits on x86 __AVX512 VNNI__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** | | 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** | | 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** | | 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** | | 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** | | 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** | | 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** | | 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** | __AVX512__ | M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit / 4-bit) | |:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:| | 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** | | 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** | | 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** | | 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** | | 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** | | 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** | | 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** | | 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** | Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down. For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down. ### Motivation and Context MatMul4Bits model has repetition issue. 6b model resolved this issue.
|
This PR has been included in the |
Description
Add 8bits support for matmulnbits on x86
AVX512 VNNI
AVX512
Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down.
For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down.
Motivation and Context
MatMul4Bits model has repetition issue. 6b model resolved this issue.