[MLAS] Fix Lut GEMM Flakiness and Accuracy#27216
Merged
Conversation
vraspar
requested changes
Jan 31, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes critical bugs in the MatMulNBitsLutGemm (T-MAC) operator that caused intermittent failures and numerical accuracy issues for multi-row activations. The fixes address scale indexing errors, race conditions in parallel processing, and buffer allocation issues.
Changes:
- Fixed incorrect LUT scale indexing from
kk / (ActK * 4)tokk / ActKin AVX2 kernel - Serialized activation loop to eliminate race conditions in multi-row processing
- Added tail handling for matrices where dimensions are not multiples of 32
- Corrected buffer size calculations and added explicit zero-initialization
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| onnxruntime/test/contrib_ops/matmul_2bits_test.cc | Increased tolerance to 1.0f for Batch32 asymmetric test to account for T-MAC's lossy quantization |
| onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp | Fixed scale indexing bug, added tail case handling, and added explicit buffer initialization |
| onnxruntime/core/mlas/lib/qlutgemm.cpp | Corrected buffer size calculation, serialized activation loop, and added explicit zero-initialization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b82fa9e to
ecc7081
Compare
ecc7081 to
2d0cc15
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vraspar
approved these changes
Feb 10, 2026
tianleiwu
added a commit
that referenced
this pull request
Feb 12, 2026
This PR resolves flakiness and accuracy issues in the `MatMulNBitsLutGemm` operator. ## Root Cause Analysis The `MatMulNBitsLutGemm` operator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes. ## Identified Root Causes ### 1. Data Race in [LutGemmPackQuantBData](https://github.com/microsoft/onnxruntime/blob/cee825d34d533ca325bfd8f8269c86133ae512e6/onnxruntime/core/mlas/lib/qlutgemm.cpp#L166-L295) - **Issue**: The weight packing loop was parallelized across output features ($N$). Since T-MAC packs multiple features into a single byte, concurrent updates to the same byte caused bit-level corruption. - **Fix**: Serialized the sub-byte accumulation phase of the weight packing process. ### 2. Thread-Safety in Global Configuration Map - **Issue**: `tmac_kernel_configs` (a static `std::unordered_map`) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads. - **Fix**: Added `std::mutex` protection and modified the parameter getter to return by value. ### 3. Tiling Dimension Mismatch and Buffer Safety - **Issue**: The orchestrator used batch size ($M$) for kernel configuration, while weights are tiled by features ($N$). Additionally, the kernel lacked clamping for partial tiles, leading to potential overruns. - **Fix**: Synchronized tiling logic by using $N$ for initialization, passing `TotalN` for parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel. ### Verification Results - `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` passed 100 consecutive iterations. - Full MatMul2Bits suite passed all 10 tests with standard **0.15f** tolerance.
tianleiwu
added a commit
that referenced
this pull request
Feb 13, 2026
This cherry-picks the following commits for the 1.24.2 release: - #27096 - #27077 - #26677 - #27238 - #27213 - #27256 - #27278 - #27275 - #27276 - #27216 - #27271 - #27299 - #27294 - #27266 - #27176 - #27126 - #27252 --------- Co-authored-by: Xiaofei Han <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: qti-monumeen <[email protected]> Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: guschmue <[email protected]> Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: angelser <[email protected]> Co-authored-by: Angela Serrano Brummett <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: hariharans29 <[email protected]> Co-authored-by: eserscor <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Ti-Tai Wang <[email protected]> Co-authored-by: bmehta001 <[email protected]>
This was referenced Feb 23, 2026
deps(nuget): Bump the microsoft-packages group with 2 updates
Ellerbach/azure-ai-search-simulator#53
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR resolves flakiness and accuracy issues in the
MatMulNBitsLutGemmoperator.Root Cause Analysis
The
MatMulNBitsLutGemmoperator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes.Identified Root Causes
1. Data Race in LutGemmPackQuantBData
2. Thread-Safety in Global Configuration Map
tmac_kernel_configs(a staticstd::unordered_map) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads.std::mutexprotection and modified the parameter getter to return by value.3. Tiling Dimension Mismatch and Buffer Safety
TotalNfor parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel.Verification Results
MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256passed 100 consecutive iterations.