fix: correct attention temperature scaling formula for Mistral3/Devstral #1

nlasky2000-dot · 2025-12-14T05:58:39Z

Summary

This PR fixes an off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models.

Problem

The attention temperature scaling formula in llama-graph.cpp was computing:

log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1

But the correct formula (matching the PyTorch reference implementation) is:

1 + beta * log(1 + floor(pos / max_position_embeddings))

Impact

This caused divergence from the reference implementation for positions >= original_max_position_embeddings - 1. For example, with original_max_position_embeddings=8192:

Position	Old Formula	Correct Formula
8191	1.0693	1.0
8192	1.0693	1.0693

Fix

Changed the formula to match the PyTorch reference implementation exactly.

Related Issue

Fixes: ggml-org#17980

@nlasky2000-dot can click here to continue refining the PR

Fix off-by-one error in the attention temperature scaling formula used by Mistral3 and Devstral models. The formula was computing: log(floor((pos + 1) / max_position_embeddings) + 1) * beta + 1 But the correct formula (matching the PyTorch reference implementation) is: 1 + beta * log(1 + floor(pos / max_position_embeddings)) This caused divergence from the reference implementation for positions >= original_max_position_embeddings - 1 (e.g., position 8191 for models with original_max_position_embeddings=8192). Fixes: ggml-org#17980 Co-authored-by: openhands <[email protected]>

nlasky2000-dot marked this pull request as ready for review December 14, 2025 06:02

nlasky2000-dot merged commit 4cf8a56 into master Dec 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct attention temperature scaling formula for Mistral3/Devstral #1

fix: correct attention temperature scaling formula for Mistral3/Devstral #1

Uh oh!

nlasky2000-dot commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: correct attention temperature scaling formula for Mistral3/Devstral #1

fix: correct attention temperature scaling formula for Mistral3/Devstral #1

Uh oh!

Conversation

nlasky2000-dot commented Dec 14, 2025

Summary

Problem

Impact

Fix

Related Issue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants