Eval bug: Devstral diverges from reference implementation

### Name and Version

version: 7327 (c8554b66e)
built with GNU 15.2.1 for Linux x86_64


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 5090

### Models

Devstral-Small-2-24B-Instruct-2512

### Problem description & steps to reproduce

When running `causal-verify-logits`, there is a pretty sizeable divergence between the models:

```console
🔍 GGML Model Validation for model  Devstral-Small-2-24B-Instruct-2512
========================================
PyTorch logits  : data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin

Top 10 PyTorch logits: [6.46875 5.8125  5.59375 5.40625 5.375   5.375   5.375   5.34375 5.28125
 5.21875]
Top 10 llama.cpp logits: [6.3750205 6.354723  5.9528265 5.5883093 5.538124  5.3347197 5.3297453
 5.3012514 5.2301598 5.226676 ]
Max absolute difference: 1.6652
✅ OK: Lightweight model check successful!
       Ok to proceed with NMSE check...
Model name: Devstral-Small-2-24B-Instruct-2512
PyTorch logits file: data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits file: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin
📊 NMSE Check for Model Comparison
==================================================
Reference (ground truth): data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
Test (to evaluate):       data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin

Loading reference logits...
  Shape: (131072,), Type: float32
Loading test logits...
  Shape: (131072,), Type: float32

✅ Shapes match: (131072,)

📈 METRICS
==============================
MSE (Mean Squared Error):     3.468930e-02
Reference Variance:           2.693974e+00
NMSE:                         1.287663e-02
Max Absolute Error:           1.665206
Mean Absolute Error:          0.150697
NMSE (dB):                    -18.90 dB

🎯 INTERPRETATION
==============================
⚠ Acceptable match

📋 GUIDANCE
==============================
⚠  ACCEPTABLE: Conversion is working but with some differences.
   Check if you're using quantization (Q4, Q8, etc.)
   Test generation quality to see if it's acceptable.

📚 NMSE BENCHMARKS
==============================
< 1e-6:  Essentially identical
< 1e-4:  Excellent (typical for good conversions)
< 1e-3:  Very good
< 1e-2:  Good (acceptable for most use cases)
< 0.1:   Acceptable (may need verification)
> 1.0:   Poor (worse than random)

❌ RESULT: NEEDS REVIEW (NMSE = 1.29e-02)
```

(note: since this is a multimodal model AND appends BOS to the inputs, which Transformers does not seem to recognize, I had to modify the script a bit, here's the version I used: https://gist.github.com/pwilkin/139dcf6fc2867ec1cae58982ff6ff1c7)

[devstral-org.txt](https://github.com/user-attachments/files/24137669/devstral-org.txt)
[devstral-conv.txt](https://github.com/user-attachments/files/24137668/devstral-conv.txt)

### First Bad Commit

_No response_

### Relevant log output

```shell
On the very first layer, when analyzing tensor dumps, there are already pretty significant divergences:


ggml_debug:                kqv_out-0 = (f32)    RESHAPE(__fattn__-0{128, 32, 6, 1}, }) = {4096, 6, 1, 1}
                                     [
                                      [
                                       [     -0.0021,       0.0016,       0.0007, ...,       0.0016,      -0.0010,      -0.0011],
                                       [     -0.0020,       0.0015,       0.0010, ...,       0.0028,       0.0192,       0.0009],
                                       [     -0.0021,       0.0008,       0.0014, ...,       0.0016,       0.0023,      -0.0007],
                                       [     -0.0020,       0.0008,       0.0005, ...,       0.0024,       0.0097,      -0.0008],
                                       [     -0.0020,       0.0011,       0.0007, ...,       0.0018,       0.0091,      -0.0023],
                                       [     -0.0018,       0.0008,       0.0008, ...,       0.0012,       0.0032,      -0.0017],
                                      ],
                                     ]
                                     sum = -9.084051
ggml_debug:               attn_out-0 = (f32)    MUL_MAT(blk.0.attn_output.weight{4096, 5120, 1, 1}, kqv_out-0{4096, 6, 1, 1}}) = {5120, 6, 1, 1}
                                     [
                                      [
                                       [     -0.0029,       0.0006,      -0.0006, ...,      -0.0003,       0.0017,       0.0005],
                                       [      0.0005,       0.0006,      -0.0001, ...,      -0.0009,       0.0004,      -0.0015],
                                       [     -0.0014,       0.0012,       0.0007, ...,      -0.0003,      -0.0019,      -0.0013],
                                       [      0.0011,       0.0014,      -0.0002, ...,      -0.0005,       0.0007,      -0.0014],
                                       [      0.0009,       0.0015,      -0.0012, ...,       0.0002,       0.0006,      -0.0021],
                                       [     -0.0017,       0.0002,      -0.0022, ...,       0.0010,       0.0001,      -0.0009],
                                      ],
                                     ]
                                     sum = -0.582590


vs


ggml_debug: model.language_model.layers.0.self_attn.o_proj_in = (f32)  ... = {torch.Size([1, 6, 4096])}
                                     [
                                      [
                                       [     -0.0017,       0.0014,       0.0008, ...,       0.0014,      -0.0008,      -0.0011]
                                       [     -0.0017,       0.0013,       0.0010, ...,       0.0027,       0.0197,       0.0006]
                                       [     -0.0018,       0.0005,       0.0014, ...,       0.0014,       0.0024,      -0.0007]
                                       [     -0.0017,       0.0005,       0.0005, ...,       0.0023,       0.0096,      -0.0009]
                                       [     -0.0017,       0.0008,       0.0007, ...,       0.0016,       0.0092,      -0.0023]
                                       [     -0.0016,       0.0006,       0.0008, ...,       0.0011,       0.0033,      -0.0017]
                                      ],
                                     ]
                                     sum = -9.691846

ggml_debug: model.language_model.layers.0.self_attn.o_proj_out = (f32)  ... = {torch.Size([1, 6, 5120])}
                                     [
                                      [
                                       [     -0.0029,       0.0006,      -0.0007, ...,      -0.0003,       0.0018,       0.0006]
                                       [      0.0003,       0.0006,      -0.0002, ...,      -0.0008,       0.0005,      -0.0014]
                                       [     -0.0016,       0.0012,       0.0006, ...,      -0.0003,      -0.0017,      -0.0012]
                                       [      0.0010,       0.0014,      -0.0003, ...,      -0.0004,       0.0008,      -0.0014]
                                       [      0.0009,       0.0014,      -0.0014, ...,       0.0002,       0.0006,      -0.0021]
                                       [     -0.0017,       0.0001,      -0.0022, ...,       0.0010,       0.0001,      -0.0008]
                                      ],
                                     ]
                                     sum = -0.603682
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Devstral diverges from reference implementation #17980

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Devstral diverges from reference implementation #17980

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions