Skip to content

Conversation

@nlasky2000-dot
Copy link

Note: While this fix corrects the attention temperature scaling formula, the issue description mentions divergence happening for the first 6 tokens where both formulas would give 1.0. There may be additional factors contributing to the divergence (such as floating-point precision differences or other implementation details), but this fix addresses the clear bug in the formula that would cause significant divergence for longer sequences.

openhands-agent and others added 2 commits December 14, 2025 05:58
Fix off-by-one error in the attention temperature scaling formula used by
Mistral3 and Devstral models. The formula was computing:

  log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1

But the correct formula (matching the PyTorch reference implementation) is:

  1 + beta * log(1 + floor(pos / max_position_embeddings))

This caused divergence from the reference implementation for positions
>= original_max_position_embeddings - 1 (e.g., position 8191 for models
with original_max_position_embeddings=8192).

Fixes: ggml-org#17980

Co-authored-by: openhands <[email protected]>
fix: correct attention temperature scaling formula for Mistral3/Devstral
@ngxson
Copy link
Collaborator

ngxson commented Dec 14, 2025

we already acknowledge this difference during before the model released. but obviously it won't make such a big difference as you said the temp stays 1.0 until it passes n_attn_temp_floor_scale (which is 8192), but most tests suggests that the divergence happens before this number

so unless there are perplexity results to prove if this does improve the performance, I don't think it's good to merge as it will also affect llama 4 model.

@ggerganov
Copy link
Member

@ngxson I think this is a legitimate fix for llama 4 temperature scaling. Even the reference vllm implementation computes it like this:

https://github.com/vllm-project/vllm/pull/28145/files#diff-48d2ca5476d5b776f6401436fcf015c5ce4dc1a23d2b78a09e08fb85acc3697cR232-R242

Note that I don't think this temperature fix has anything to do with #17980, which discusses a possible divergence. Atm, I don't think there any divergence since the data in #17980 are from an old version, before the YaRN fix in #17945

So unrelated to #17980, I think we should include this change. Note that PPL will not show anything because this temperature gets activated at much larger positions.

@ngxson
Copy link
Collaborator

ngxson commented Dec 14, 2025

@ggerganov hmm ok, though still I think it won't make a very big difference even on high position tokens.

this PR only misses the support for llama 4 model, I'll open another PR for it

@ggerganov
Copy link
Member

Agree that this effect is going to be very tiny - I think it's more about having a consistent code with the rest of the engines, rather than expecting to change the quality significantly.

this PR only misses the support for llama 4 model, I'll open another PR for it

Let's do both versions together in a single PR?

@ngxson
Copy link
Collaborator

ngxson commented Dec 14, 2025

yes but I cannot push to this PR because it was created from master branch of the forked repo, will need to push a new one

@ggerganov
Copy link
Member

@nlasky2000-dot Thanks for reporting this. We'll push a separate PR to cover llama4 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants