fix: correct attention temperature scaling formula for Mistral3/Devstral #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes an off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models.
Problem
The attention temperature scaling formula in
llama-graph.cppwas computing:But the correct formula (matching the PyTorch reference implementation) is:
Impact
This caused divergence from the reference implementation for positions >=
original_max_position_embeddings - 1. For example, withoriginal_max_position_embeddings=8192:Fix
Changed the formula to match the PyTorch reference implementation exactly.
Related Issue
Fixes: ggml-org#17980
@nlasky2000-dot can click here to continue refining the PR