[MLAS] Fix Lut GEMM Flakiness and Accuracy by tianleiwu · Pull Request #27216 · microsoft/onnxruntime

tianleiwu · 2026-01-30T22:21:44Z

This PR resolves flakiness and accuracy issues in the MatMulNBitsLutGemm operator.

Root Cause Analysis

The MatMulNBitsLutGemm operator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes.

Identified Root Causes

1. Data Race in LutGemmPackQuantBData

Issue: The weight packing loop was parallelized across output features ($N$). Since T-MAC packs multiple features into a single byte, concurrent updates to the same byte caused bit-level corruption.
Fix: Serialized the sub-byte accumulation phase of the weight packing process.

2. Thread-Safety in Global Configuration Map

Issue: tmac_kernel_configs (a static std::unordered_map) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads.
Fix: Added std::mutex protection and modified the parameter getter to return by value.

3. Tiling Dimension Mismatch and Buffer Safety

Issue: The orchestrator used batch size ($M$) for kernel configuration, while weights are tiled by features ($N$). Additionally, the kernel lacked clamping for partial tiles, leading to potential overruns.
Fix: Synchronized tiling logic by using $N$ for initialization, passing TotalN for parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel.

Verification Results

MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256 passed 100 consecutive iterations.
Full MatMul2Bits suite passed all 10 tests with standard 0.15f tolerance.

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/core/mlas/lib/qlutgemm.cpp

onnxruntime/test/contrib_ops/matmul_2bits_test.cc

Copilot

Pull request overview

This PR fixes critical bugs in the MatMulNBitsLutGemm (T-MAC) operator that caused intermittent failures and numerical accuracy issues for multi-row activations. The fixes address scale indexing errors, race conditions in parallel processing, and buffer allocation issues.

Changes:

Fixed incorrect LUT scale indexing from kk / (ActK * 4) to kk / ActK in AVX2 kernel
Serialized activation loop to eliminate race conditions in multi-row processing
Added tail handling for matrices where dimensions are not multiples of 32
Corrected buffer size calculations and added explicit zero-initialization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
onnxruntime/test/contrib_ops/matmul_2bits_test.cc	Increased tolerance to 1.0f for Batch32 asymmetric test to account for T-MAC's lossy quantization
onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp	Fixed scale indexing bug, added tail case handling, and added explicit buffer initialization
onnxruntime/core/mlas/lib/qlutgemm.cpp	Corrected buffer size calculation, serialized activation loop, and added explicit zero-initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

onnxruntime/test/contrib_ops/matmul_2bits_test.cc

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This PR resolves flakiness and accuracy issues in the `MatMulNBitsLutGemm` operator. ## Root Cause Analysis The `MatMulNBitsLutGemm` operator exhibited non-deterministic flakiness and numerical accuracy issues. This analysis covers the root causes addressed by the changes. ## Identified Root Causes ### 1. Data Race in [LutGemmPackQuantBData](https://github.com/microsoft/onnxruntime/blob/cee825d34d533ca325bfd8f8269c86133ae512e6/onnxruntime/core/mlas/lib/qlutgemm.cpp#L166-L295) - **Issue**: The weight packing loop was parallelized across output features ($N$). Since T-MAC packs multiple features into a single byte, concurrent updates to the same byte caused bit-level corruption. - **Fix**: Serialized the sub-byte accumulation phase of the weight packing process. ### 2. Thread-Safety in Global Configuration Map - **Issue**: `tmac_kernel_configs` (a static `std::unordered_map`) was accessed concurrently. Map insertions or rehashing during initialization could invalidate references held by other threads. - **Fix**: Added `std::mutex` protection and modified the parameter getter to return by value. ### 3. Tiling Dimension Mismatch and Buffer Safety - **Issue**: The orchestrator used batch size ($M$) for kernel configuration, while weights are tiled by features ($N$). Additionally, the kernel lacked clamping for partial tiles, leading to potential overruns. - **Fix**: Synchronized tiling logic by using $N$ for initialization, passing `TotalN` for parameter retrieval, and implementing explicit clamping and tail-case handling in the AVX2 kernel. ### Verification Results - `MatMulNBitsLutGemm.Float32_2Bits_Asymmetric_Batch32_256x256` passed 100 consecutive iterations. - Full MatMul2Bits suite passed all 10 tests with standard **0.15f** tolerance.

This cherry-picks the following commits for the 1.24.2 release: - #27096 - #27077 - #26677 - #27238 - #27213 - #27256 - #27278 - #27275 - #27276 - #27216 - #27271 - #27299 - #27294 - #27266 - #27176 - #27126 - #27252 --------- Co-authored-by: Xiaofei Han <[email protected]> Co-authored-by: Jiajia Qin <[email protected]> Co-authored-by: Yulong Wang <[email protected]> Co-authored-by: qti-monumeen <[email protected]> Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: Eric Crawford <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: guschmue <[email protected]> Co-authored-by: Guenther Schmuelling <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: angelser <[email protected]> Co-authored-by: Angela Serrano Brummett <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Co-authored-by: hariharans29 <[email protected]> Co-authored-by: eserscor <[email protected]> Co-authored-by: Copilot <[email protected]> Co-authored-by: Baiju Meswani <[email protected]> Co-authored-by: Adrian Lizarraga <[email protected]> Co-authored-by: Ti-Tai Wang <[email protected]> Co-authored-by: bmehta001 <[email protected]>

tianleiwu requested a review from vraspar January 31, 2026 00:28

vraspar requested changes Jan 31, 2026

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/qlutgemm.cpp Show resolved Hide resolved

onnxruntime/test/contrib_ops/matmul_2bits_test.cc Outdated Show resolved Hide resolved

tianleiwu requested a review from vraspar January 31, 2026 01:15

vraspar requested a review from Copilot January 31, 2026 01:28

Copilot started reviewing on behalf of vraspar January 31, 2026 01:28 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

onnxruntime/test/contrib_ops/matmul_2bits_test.cc Outdated Show resolved Hide resolved

onnxruntime/core/mlas/lib/sqnbitgemm_lut_kernel_avx2.cpp Outdated Show resolved Hide resolved

tianleiwu force-pushed the tlwu/fix_lut_gemm branch 2 times, most recently from b82fa9e to ecc7081 Compare January 31, 2026 05:40

tianleiwu changed the title ~~[MLAS] Fix Lut GEMM~~ [MLAS] Fix Lut GEMM Flakiness and Accuracy Jan 31, 2026

Fix Lut GEMM Flakiness and Accuracy

2d0cc15

tianleiwu force-pushed the tlwu/fix_lut_gemm branch from ecc7081 to 2d0cc15 Compare January 31, 2026 06:07

tianleiwu requested a review from Copilot January 31, 2026 06:07

Copilot started reviewing on behalf of tianleiwu January 31, 2026 06:08 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

tianleiwu added the release:1.24.2 label Feb 4, 2026

vraspar approved these changes Feb 10, 2026

View reviewed changes

tianleiwu merged commit 9a4f463 into main Feb 11, 2026
124 of 159 checks passed

tianleiwu deleted the tlwu/fix_lut_gemm branch February 11, 2026 02:04

tianleiwu mentioned this pull request Feb 12, 2026

ORT 1.24.2 release cherry pick round 1 #27330

Merged

tianleiwu removed the release:1.24.2 label Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Fix Lut GEMM Flakiness and Accuracy#27216

[MLAS] Fix Lut GEMM Flakiness and Accuracy#27216
tianleiwu merged 1 commit intomainfrom
tlwu/fix_lut_gemm

tianleiwu commented Jan 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tianleiwu commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause Analysis

Identified Root Causes

1. Data Race in LutGemmPackQuantBData

2. Thread-Safety in Global Configuration Map

3. Tiling Dimension Mismatch and Buffer Safety

Verification Results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianleiwu commented Jan 30, 2026 •

edited

Loading