Skip to content

Conversation

@nlasky2000-dot
Copy link
Owner

Summary

This PR fixes an off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models.

Problem

The attention temperature scaling formula in llama-graph.cpp was computing:

log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1

But the correct formula (matching the PyTorch reference implementation) is:

1 + beta * log(1 + floor(pos / max_position_embeddings))

Impact

This caused divergence from the reference implementation for positions >= original_max_position_embeddings - 1. For example, with original_max_position_embeddings=8192:

Position Old Formula Correct Formula
8191 1.0693 1.0
8192 1.0693 1.0693

Fix

Changed the formula to match the PyTorch reference implementation exactly.

Related Issue

Fixes: ggml-org#17980

@nlasky2000-dot can click here to continue refining the PR

Fix off-by-one error in the attention temperature scaling formula used by
Mistral3 and Devstral models. The formula was computing:

  log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1

But the correct formula (matching the PyTorch reference implementation) is:

  1 + beta * log(1 + floor(pos / max_position_embeddings))

This caused divergence from the reference implementation for positions
>= original_max_position_embeddings - 1 (e.g., position 8191 for models
with original_max_position_embeddings=8192).

Fixes: ggml-org#17980

Co-authored-by: openhands <[email protected]>
@nlasky2000-dot nlasky2000-dot marked this pull request as ready for review December 14, 2025 06:02
@nlasky2000-dot nlasky2000-dot merged commit 4cf8a56 into master Dec 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: Devstral diverges from reference implementation

3 participants