-
Notifications
You must be signed in to change notification settings - Fork 14.1k
fix-mistral3-attn-temp-scaling #18018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models. The formula was computing: log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1 But the correct formula (matching the PyTorch reference implementation) is: 1 + beta * log(1 + floor(pos / max_position_embeddings)) This caused divergence from the reference implementation for positions >= original_max_position_embeddings - 1 (e.g., position 8191 for models with original_max_position_embeddings=8192). Fixes: ggml-org#17980 Co-authored-by: openhands <[email protected]>
fix: correct attention temperature scaling formula for Mistral3/Devstral
|
we already acknowledge this difference during before the model released. but obviously it won't make such a big difference as you said the temp stays so unless there are perplexity results to prove if this does improve the performance, I don't think it's good to merge as it will also affect llama 4 model. |
|
@ngxson I think this is a legitimate fix for llama 4 temperature scaling. Even the reference vllm implementation computes it like this: Note that I don't think this temperature fix has anything to do with #17980, which discusses a possible divergence. Atm, I don't think there any divergence since the data in #17980 are from an old version, before the YaRN fix in #17945 So unrelated to #17980, I think we should include this change. Note that PPL will not show anything because this temperature gets activated at much larger positions. |
|
@ggerganov hmm ok, though still I think it won't make a very big difference even on high position tokens. this PR only misses the support for llama 4 model, I'll open another PR for it |
|
Agree that this effect is going to be very tiny - I think it's more about having a consistent code with the rest of the engines, rather than expecting to change the quality significantly.
Let's do both versions together in a single PR? |
|
yes but I cannot push to this PR because it was created from |
|
@nlasky2000-dot Thanks for reporting this. We'll push a separate PR to cover llama4 as well. |
Note: While this fix corrects the attention temperature scaling formula, the issue description mentions divergence happening for the first 6 tokens where both formulas would give 1.0. There may be additional factors contributing to the divergence (such as floating-point precision differences or other implementation details), but this fix addresses the clear bug in the formula that would cause significant divergence for longer sequences.