[MLAS] Fix rotary avx2 kernel invalid access#26389
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical memory access violation in the AVX2 rotary embedding kernel. The issue stems from the use of _mm256_maskload_ps and _mm256_maskstore_ps intrinsics, which can read/write beyond the masked elements, potentially causing segmentation faults when buffers are near page boundaries.
Key Changes:
- Removed masked AVX2 remainder handling logic that could cause invalid memory access
- Replaced with safe scalar loops for processing trailing elements (1-15 elements)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]>
titaiwangms
approved these changes
Oct 24, 2025
naomiOvad
pushed a commit
to naomiOvad/onnxruntime
that referenced
this pull request
Nov 2, 2025
This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in microsoft#23694. The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements. Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address. The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access). The Solution: Use a Scalar Remainder Loop The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug. The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements. --------- Co-authored-by: Copilot <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in #23694.
The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements.
Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address.
The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access).
The Solution: Use a Scalar Remainder Loop
The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug.
The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements.