[MLAS] Fix rotary avx2 kernel invalid access by tianleiwu · Pull Request #26389 · microsoft/onnxruntime

tianleiwu · 2025-10-22T22:26:07Z

This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in #23694.

The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements.
Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address.

The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access).

The Solution: Use a Scalar Remainder Loop

The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug.

The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements.

Copilot

Pull Request Overview

This PR fixes a critical memory access violation in the AVX2 rotary embedding kernel. The issue stems from the use of _mm256_maskload_ps and _mm256_maskstore_ps intrinsics, which can read/write beyond the masked elements, potentially causing segmentation faults when buffers are near page boundaries.

Key Changes:

Removed masked AVX2 remainder handling logic that could cause invalid memory access
Replaced with safe scalar loops for processing trailing elements (1-15 elements)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2.cpp

Co-authored-by: Copilot <[email protected]>

This fixes an issue that _mm256_maskload_ps intrinsic used in remainder-handling logic introduced in microsoft#23694. The core of the problem is that _mm256_maskload_ps (and its store equivalent) can read beyond the masked elements. Even if mask correctly specifies that you only want to load, for example, 3 floats, the intrinsic may still read the full 32 bytes (8 floats) from the provided memory address. The invalid access occurs when one of buffers (input, sin_data, or cos_data) ends near the boundary of a memory page, and the part of the 32-byte read that you don't care about (i.e., the masked-off part) falls onto an unmapped page. This will cause a segmentation fault (invalid access). The Solution: Use a Scalar Remainder Loop The simplest, safest, and most robust solution is to replace the masked AVX remainder logic with a simple scalar loop. This is the exact strategy already used by your RopeKernel_Avx2_fp16_Impl functions, which are safe from this bug. The performance impact of this change will be negligible, as this loop only processes the final 1-15 elements. --------- Co-authored-by: Copilot <[email protected]>

Fix out of boundary read

7986ad5

tianleiwu marked this pull request as draft October 22, 2025 22:26

remove unused variable

9573831

tianleiwu requested review from Copilot and titaiwangms October 22, 2025 23:17

Copilot AI reviewed Oct 22, 2025

View reviewed changes

onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2.cpp Outdated Show resolved Hide resolved

tianleiwu and others added 2 commits October 22, 2025 16:46

Update onnxruntime/core/mlas/lib/rotary_embedding_kernel_avx2.cpp

3c80d30

Co-authored-by: Copilot <[email protected]>

update fp16 interleaved

e2f6c9a

tianleiwu marked this pull request as ready for review October 23, 2025 20:38

titaiwangms approved these changes Oct 24, 2025

View reviewed changes

tianleiwu merged commit 1f45838 into main Oct 24, 2025
92 checks passed

tianleiwu deleted the tlwu/fix_cope branch October 24, 2025 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Fix rotary avx2 kernel invalid access#26389

[MLAS] Fix rotary avx2 kernel invalid access#26389
tianleiwu merged 4 commits intomainfrom
tlwu/fix_cope

tianleiwu commented Oct 22, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianleiwu commented Oct 22, 2025 •

edited

Loading