Skip to content

Eval bug: Devstral diverges from reference implementation #17980

@pwilkin

Description

@pwilkin

Name and Version

version: 7327 (c8554b6)
built with GNU 15.2.1 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 5090

Models

Devstral-Small-2-24B-Instruct-2512

Problem description & steps to reproduce

When running causal-verify-logits, there is a pretty sizeable divergence between the models:

🔍 GGML Model Validation for model  Devstral-Small-2-24B-Instruct-2512
========================================
PyTorch logits  : data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin

Top 10 PyTorch logits: [6.46875 5.8125  5.59375 5.40625 5.375   5.375   5.375   5.34375 5.28125
 5.21875]
Top 10 llama.cpp logits: [6.3750205 6.354723  5.9528265 5.5883093 5.538124  5.3347197 5.3297453
 5.3012514 5.2301598 5.226676 ]
Max absolute difference: 1.6652
✅ OK: Lightweight model check successful!
       Ok to proceed with NMSE check...
Model name: Devstral-Small-2-24B-Instruct-2512
PyTorch logits file: data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits file: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin
📊 NMSE Check for Model Comparison
==================================================
Reference (ground truth): data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
Test (to evaluate):       data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin

Loading reference logits...
  Shape: (131072,), Type: float32
Loading test logits...
  Shape: (131072,), Type: float32

✅ Shapes match: (131072,)

📈 METRICS
==============================
MSE (Mean Squared Error):     3.468930e-02
Reference Variance:           2.693974e+00
NMSE:                         1.287663e-02
Max Absolute Error:           1.665206
Mean Absolute Error:          0.150697
NMSE (dB):                    -18.90 dB

🎯 INTERPRETATION
==============================
⚠ Acceptable match

📋 GUIDANCE
==============================
⚠  ACCEPTABLE: Conversion is working but with some differences.
   Check if you're using quantization (Q4, Q8, etc.)
   Test generation quality to see if it's acceptable.

📚 NMSE BENCHMARKS
==============================
< 1e-6:  Essentially identical
< 1e-4:  Excellent (typical for good conversions)
< 1e-3:  Very good
< 1e-2:  Good (acceptable for most use cases)
< 0.1:   Acceptable (may need verification)
> 1.0:   Poor (worse than random)

❌ RESULT: NEEDS REVIEW (NMSE = 1.29e-02)

(note: since this is a multimodal model AND appends BOS to the inputs, which Transformers does not seem to recognize, I had to modify the script a bit, here's the version I used: https://gist.github.com/pwilkin/139dcf6fc2867ec1cae58982ff6ff1c7)

devstral-org.txt
devstral-conv.txt

First Bad Commit

No response

Relevant log output

On the very first layer, when analyzing tensor dumps, there are already pretty significant divergences:


ggml_debug:                kqv_out-0 = (f32)    RESHAPE(__fattn__-0{128, 32, 6, 1}, }) = {4096, 6, 1, 1}
                                     [
                                      [
                                       [     -0.0021,       0.0016,       0.0007, ...,       0.0016,      -0.0010,      -0.0011],
                                       [     -0.0020,       0.0015,       0.0010, ...,       0.0028,       0.0192,       0.0009],
                                       [     -0.0021,       0.0008,       0.0014, ...,       0.0016,       0.0023,      -0.0007],
                                       [     -0.0020,       0.0008,       0.0005, ...,       0.0024,       0.0097,      -0.0008],
                                       [     -0.0020,       0.0011,       0.0007, ...,       0.0018,       0.0091,      -0.0023],
                                       [     -0.0018,       0.0008,       0.0008, ...,       0.0012,       0.0032,      -0.0017],
                                      ],
                                     ]
                                     sum = -9.084051
ggml_debug:               attn_out-0 = (f32)    MUL_MAT(blk.0.attn_output.weight{4096, 5120, 1, 1}, kqv_out-0{4096, 6, 1, 1}}) = {5120, 6, 1, 1}
                                     [
                                      [
                                       [     -0.0029,       0.0006,      -0.0006, ...,      -0.0003,       0.0017,       0.0005],
                                       [      0.0005,       0.0006,      -0.0001, ...,      -0.0009,       0.0004,      -0.0015],
                                       [     -0.0014,       0.0012,       0.0007, ...,      -0.0003,      -0.0019,      -0.0013],
                                       [      0.0011,       0.0014,      -0.0002, ...,      -0.0005,       0.0007,      -0.0014],
                                       [      0.0009,       0.0015,      -0.0012, ...,       0.0002,       0.0006,      -0.0021],
                                       [     -0.0017,       0.0002,      -0.0022, ...,       0.0010,       0.0001,      -0.0009],
                                      ],
                                     ]
                                     sum = -0.582590


vs


ggml_debug: model.language_model.layers.0.self_attn.o_proj_in = (f32)  ... = {torch.Size([1, 6, 4096])}
                                     [
                                      [
                                       [     -0.0017,       0.0014,       0.0008, ...,       0.0014,      -0.0008,      -0.0011]
                                       [     -0.0017,       0.0013,       0.0010, ...,       0.0027,       0.0197,       0.0006]
                                       [     -0.0018,       0.0005,       0.0014, ...,       0.0014,       0.0024,      -0.0007]
                                       [     -0.0017,       0.0005,       0.0005, ...,       0.0023,       0.0096,      -0.0009]
                                       [     -0.0017,       0.0008,       0.0007, ...,       0.0016,       0.0092,      -0.0023]
                                       [     -0.0016,       0.0006,       0.0008, ...,       0.0011,       0.0033,      -0.0017]
                                      ],
                                     ]
                                     sum = -9.691846

ggml_debug: model.language_model.layers.0.self_attn.o_proj_out = (f32)  ... = {torch.Size([1, 6, 5120])}
                                     [
                                      [
                                       [     -0.0029,       0.0006,      -0.0007, ...,      -0.0003,       0.0018,       0.0006]
                                       [      0.0003,       0.0006,      -0.0002, ...,      -0.0008,       0.0005,      -0.0014]
                                       [     -0.0016,       0.0012,       0.0006, ...,      -0.0003,      -0.0017,      -0.0012]
                                       [      0.0010,       0.0014,      -0.0003, ...,      -0.0004,       0.0008,      -0.0014]
                                       [      0.0009,       0.0014,      -0.0014, ...,       0.0002,       0.0006,      -0.0021]
                                       [     -0.0017,       0.0001,      -0.0022, ...,       0.0010,       0.0001,      -0.0008]
                                      ],
                                     ]
                                     sum = -0.603682

Metadata

Metadata

Assignees

Labels

need feedbackTesting and feedback with results are needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions