-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Closed
nlasky2000-dot/llama.cpp
#1Closed
Copy link
Labels
need feedbackTesting and feedback with results are neededTesting and feedback with results are needed
Description
Name and Version
version: 7327 (c8554b6)
built with GNU 15.2.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 5090
Models
Devstral-Small-2-24B-Instruct-2512
Problem description & steps to reproduce
When running causal-verify-logits, there is a pretty sizeable divergence between the models:
🔍 GGML Model Validation for model Devstral-Small-2-24B-Instruct-2512
========================================
PyTorch logits : data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin
Top 10 PyTorch logits: [6.46875 5.8125 5.59375 5.40625 5.375 5.375 5.375 5.34375 5.28125
5.21875]
Top 10 llama.cpp logits: [6.3750205 6.354723 5.9528265 5.5883093 5.538124 5.3347197 5.3297453
5.3012514 5.2301598 5.226676 ]
Max absolute difference: 1.6652
✅ OK: Lightweight model check successful!
Ok to proceed with NMSE check...
Model name: Devstral-Small-2-24B-Instruct-2512
PyTorch logits file: data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
llama.cpp logits file: data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin
📊 NMSE Check for Model Comparison
==================================================
Reference (ground truth): data/pytorch-Devstral-Small-2-24B-Instruct-2512.bin
Test (to evaluate): data/llamacpp-Devstral-Small-2-24B-Instruct-2512.bin
Loading reference logits...
Shape: (131072,), Type: float32
Loading test logits...
Shape: (131072,), Type: float32
✅ Shapes match: (131072,)
📈 METRICS
==============================
MSE (Mean Squared Error): 3.468930e-02
Reference Variance: 2.693974e+00
NMSE: 1.287663e-02
Max Absolute Error: 1.665206
Mean Absolute Error: 0.150697
NMSE (dB): -18.90 dB
🎯 INTERPRETATION
==============================
⚠ Acceptable match
📋 GUIDANCE
==============================
⚠ ACCEPTABLE: Conversion is working but with some differences.
Check if you're using quantization (Q4, Q8, etc.)
Test generation quality to see if it's acceptable.
📚 NMSE BENCHMARKS
==============================
< 1e-6: Essentially identical
< 1e-4: Excellent (typical for good conversions)
< 1e-3: Very good
< 1e-2: Good (acceptable for most use cases)
< 0.1: Acceptable (may need verification)
> 1.0: Poor (worse than random)
❌ RESULT: NEEDS REVIEW (NMSE = 1.29e-02)(note: since this is a multimodal model AND appends BOS to the inputs, which Transformers does not seem to recognize, I had to modify the script a bit, here's the version I used: https://gist.github.com/pwilkin/139dcf6fc2867ec1cae58982ff6ff1c7)
devstral-org.txt
devstral-conv.txt
First Bad Commit
No response
Relevant log output
On the very first layer, when analyzing tensor dumps, there are already pretty significant divergences:
ggml_debug: kqv_out-0 = (f32) RESHAPE(__fattn__-0{128, 32, 6, 1}, }) = {4096, 6, 1, 1}
[
[
[ -0.0021, 0.0016, 0.0007, ..., 0.0016, -0.0010, -0.0011],
[ -0.0020, 0.0015, 0.0010, ..., 0.0028, 0.0192, 0.0009],
[ -0.0021, 0.0008, 0.0014, ..., 0.0016, 0.0023, -0.0007],
[ -0.0020, 0.0008, 0.0005, ..., 0.0024, 0.0097, -0.0008],
[ -0.0020, 0.0011, 0.0007, ..., 0.0018, 0.0091, -0.0023],
[ -0.0018, 0.0008, 0.0008, ..., 0.0012, 0.0032, -0.0017],
],
]
sum = -9.084051
ggml_debug: attn_out-0 = (f32) MUL_MAT(blk.0.attn_output.weight{4096, 5120, 1, 1}, kqv_out-0{4096, 6, 1, 1}}) = {5120, 6, 1, 1}
[
[
[ -0.0029, 0.0006, -0.0006, ..., -0.0003, 0.0017, 0.0005],
[ 0.0005, 0.0006, -0.0001, ..., -0.0009, 0.0004, -0.0015],
[ -0.0014, 0.0012, 0.0007, ..., -0.0003, -0.0019, -0.0013],
[ 0.0011, 0.0014, -0.0002, ..., -0.0005, 0.0007, -0.0014],
[ 0.0009, 0.0015, -0.0012, ..., 0.0002, 0.0006, -0.0021],
[ -0.0017, 0.0002, -0.0022, ..., 0.0010, 0.0001, -0.0009],
],
]
sum = -0.582590
vs
ggml_debug: model.language_model.layers.0.self_attn.o_proj_in = (f32) ... = {torch.Size([1, 6, 4096])}
[
[
[ -0.0017, 0.0014, 0.0008, ..., 0.0014, -0.0008, -0.0011]
[ -0.0017, 0.0013, 0.0010, ..., 0.0027, 0.0197, 0.0006]
[ -0.0018, 0.0005, 0.0014, ..., 0.0014, 0.0024, -0.0007]
[ -0.0017, 0.0005, 0.0005, ..., 0.0023, 0.0096, -0.0009]
[ -0.0017, 0.0008, 0.0007, ..., 0.0016, 0.0092, -0.0023]
[ -0.0016, 0.0006, 0.0008, ..., 0.0011, 0.0033, -0.0017]
],
]
sum = -9.691846
ggml_debug: model.language_model.layers.0.self_attn.o_proj_out = (f32) ... = {torch.Size([1, 6, 5120])}
[
[
[ -0.0029, 0.0006, -0.0007, ..., -0.0003, 0.0018, 0.0006]
[ 0.0003, 0.0006, -0.0002, ..., -0.0008, 0.0005, -0.0014]
[ -0.0016, 0.0012, 0.0006, ..., -0.0003, -0.0017, -0.0012]
[ 0.0010, 0.0014, -0.0003, ..., -0.0004, 0.0008, -0.0014]
[ 0.0009, 0.0014, -0.0014, ..., 0.0002, 0.0006, -0.0021]
[ -0.0017, 0.0001, -0.0022, ..., 0.0010, 0.0001, -0.0008]
],
]
sum = -0.603682aldehir and johnlovesgoats
Metadata
Metadata
Assignees
Labels
need feedbackTesting and feedback with results are neededTesting and feedback with results are needed