fix-mistral3-attn-temp-scaling #18018

nlasky2000-dot · 2025-12-14T06:02:51Z

Note: While this fix corrects the attention temperature scaling formula, the issue description mentions divergence happening for the first 6 tokens where both formulas would give 1.0. There may be additional factors contributing to the divergence (such as floating-point precision differences or other implementation details), but this fix addresses the clear bug in the formula that would cause significant divergence for longer sequences.

Fix off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models. The formula was computing: log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1 But the correct formula (matching the PyTorch reference implementation) is: 1 + beta * log(1 + floor(pos / max_position_embeddings)) This caused divergence from the reference implementation for positions >= original_max_position_embeddings - 1 (e.g., position 8191 for models with original_max_position_embeddings=8192). Fixes: ggml-org#17980 Co-authored-by: openhands <[email protected]>

fix: correct attention temperature scaling formula for Mistral3/Devstral

ngxson · 2025-12-14T10:41:51Z

we already acknowledge this difference during before the model released. but obviously it won't make such a big difference as you said the temp stays 1.0 until it passes n_attn_temp_floor_scale (which is 8192), but most tests suggests that the divergence happens before this number

so unless there are perplexity results to prove if this does improve the performance, I don't think it's good to merge as it will also affect llama 4 model.

ggerganov · 2025-12-14T11:00:28Z

@ngxson I think this is a legitimate fix for llama 4 temperature scaling. Even the reference vllm implementation computes it like this:

https://github.com/vllm-project/vllm/pull/28145/files#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cR232-R242

Note that I don't think this temperature fix has anything to do with #17980, which discusses a possible divergence. Atm, I don't think there any divergence since the data in #17980 are from an old version, before the YaRN fix in #17945

So unrelated to #17980, I think we should include this change. Note that PPL will not show anything because this temperature gets activated at much larger positions.

ngxson · 2025-12-14T11:05:42Z

@ggerganov hmm ok, though still I think it won't make a very big difference even on high position tokens.

this PR only misses the support for llama 4 model, I'll open another PR for it

ggerganov · 2025-12-14T11:10:02Z

Agree that this effect is going to be very tiny - I think it's more about having a consistent code with the rest of the engines, rather than expecting to change the quality significantly.

this PR only misses the support for llama 4 model, I'll open another PR for it

Let's do both versions together in a single PR?

ngxson · 2025-12-14T11:10:46Z

yes but I cannot push to this PR because it was created from master branch of the forked repo, will need to push a new one

ggerganov · 2025-12-14T11:11:37Z

@nlasky2000-dot Thanks for reporting this. We'll push a separate PR to cover llama4 as well.

openhands-agent and others added 2 commits December 14, 2025 05:58

Merge pull request #1 from nlasky2000-dot/fix-mistral3-attn-temp-scaling

4cf8a56

fix: correct attention temperature scaling formula for Mistral3/Devstral

nlasky2000-dot requested a review from CISC as a code owner December 14, 2025 06:02

bonswouar mentioned this pull request Dec 14, 2025

Eval bug: Devstral diverges from reference implementation #17980

Closed

ggerganov closed this Dec 14, 2025

ngxson mentioned this pull request Dec 14, 2025

graph: add f_attn_temp_offset #18025

Merged

loci-dev mentioned this pull request Dec 14, 2025

UPSTREAM PR #18025: graph: add f_attn_temp_offset auroralabs-loci/llama.cpp#562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix-mistral3-attn-temp-scaling #18018

fix-mistral3-attn-temp-scaling #18018

Uh oh!

nlasky2000-dot commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025 •

edited

Loading

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix-mistral3-attn-temp-scaling #18018

fix-mistral3-attn-temp-scaling #18018

Uh oh!

Conversation

nlasky2000-dot commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

ngxson commented Dec 14, 2025

Uh oh!

ggerganov commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Dec 14, 2025 •

edited

Loading