mtmd : fix idefics3 preprocessing by ngxson · Pull Request #16806 · ggml-org/llama.cpp

ngxson · 2025-10-27T15:39:07Z

Fix a bug spotted in #16776

Also fix the resize of overview_image, which should fix the issue discovered in #16718 (comment)

Test for granite-docling is also disabled as the model seems to output structured data instead of plain text, something like: <doctag><text><loc_39><loc_112><loc_90><loc_126>VOL.CXVIII.No.40,721</text> --> Test is fixed, now check for the presence of the word "men" && "walk"

Test result:

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M

ngxson · 2025-10-27T15:48:25Z

        // resize to overview size
        clip_image_u8_ptr resized_img(clip_image_u8_init());
-        image_manipulation::bicubic_resize(*img, *resized_img, inst.overview_size.width, inst.overview_size.height);
+        image_manipulation::resize_and_pad_image(*img, *resized_img, inst.overview_size);


also cc @gabe-l-hart for visibility, before this change, slice_image resize the overview image without padding (or preserving ratio) - it should be fixed now

gabe-l-hart · 2025-10-27T18:00:34Z

I tested this with granite-docling-258m at BF16 on the image from #16678 and I see slightly different results. On the one hand, the result is cleaner (less repetition), but on the other hand, it misses some of the text. I think overall this is an improvement, especially if this is more faithful to the preprocessing from transformers.

@ykhrustalev

* model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test * HIP: fix AMDGPU_TARGETS, update documentation (ggml-org#16803) * ggml : fix interpolate with align-corners and ne=1 (ggml-org#16700) * ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning * llama : disable pipeline parallelism if compute buffer allocation fails (ggml-org#16748) * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * chat: Add LFM2 tool handling (ggml-org#16763) * Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev * sycl: add SSM_CONV operation support (ggml-org#16800) * feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <[email protected]> * CUDA: add unused vars to mmvf and mmvq (ggml-org#16807) * CANN: Improve device ID handling and aclnnArange checks (ggml-org#16752) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var * grammar : support array references in json schema (ggml-org#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> * llama: consistent ctx <-> buf order for KV cache (ggml-org#16746) * embedding: add raw option for --embd-output-format (ggml-org#16541) * Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README --------- Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]> Co-authored-by: Acly <[email protected]> Co-authored-by: Diego Devesa <[email protected]> Co-authored-by: Yuri Khrustalev <[email protected]> Co-authored-by: tamarPal <[email protected]> Co-authored-by: tamarPal <[email protected]> Co-authored-by: Aman Gupta <[email protected]> Co-authored-by: Chenguang Li <[email protected]> Co-authored-by: Aldehir Rojas <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Sam Malayek <[email protected]>

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

* model : Granite docling + Idefics3 preprocessing (SmolVLM) (ggml-org#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]> * add test model --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (ggml-org#14928) * model : add LightOnOCR-1B model (ggml-org#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (ggml-org#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (ggml-org#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (ggml-org#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (ggml-org#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <[email protected]> * model: add Janus Pro for image understanding (ggml-org#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <[email protected]> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <[email protected]> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (ggml-org#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (ggml-org#16921) * mtmd: improve struct initialization (ggml-org#16981) * mtmd: allow QwenVL to process larger image by default (ggml-org#17020) * Disable flash attention * mtmd : fix embedding size for image input (ggml-org#17123) * mtmd: fix patch_size initialized to random value in audio models (ggml-org#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <[email protected]> Co-authored-by: Gabe Goodhart <[email protected]> Co-authored-by: Xuan Son Nguyen <[email protected]> Co-authored-by: Tianyue-Zhao <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]> Co-authored-by: Zhiyong Wang <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Xuan-Son Nguyen <[email protected]> Co-authored-by: firecoperana <firecoperana>

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

ngxson added 3 commits October 27, 2025 16:27

mtmd : fix idefics3 preprocessing

6ea7d9a

disable granite test

029735f

fix test for granite

d07533e

ngxson requested a review from ggerganov October 27, 2025 15:46

ngxson commented Oct 27, 2025

View reviewed changes

github-actions Bot added the examples label Oct 27, 2025

ggerganov approved these changes Oct 27, 2025

View reviewed changes

gabe-l-hart mentioned this pull request Oct 27, 2025

Eval bug: IBM Granite Docling goes in loop. #16678

Closed

ngxson merged commit e1ab084 into ggml-org:master Oct 27, 2025
71 of 72 checks passed

ngxson mentioned this pull request Oct 27, 2025

Resize and tile split discrepancy between HF transformers and LlamaCpp for SmolVLM2 #16776

Closed

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

4decf43

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

mtmd : fix idefics3 preprocessing (#16806)

eb466bc

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

032d4e1

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

a0a548a

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

a0e5291

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

my-other-github-account pushed a commit to my-other-github-account/llama.cpp that referenced this pull request May 15, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

257130e

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request May 29, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

0129dc6

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

mtmd : fix idefics3 preprocessing (ggml-org#16806)

00b8aa0

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtmd : fix idefics3 preprocessing#16806

mtmd : fix idefics3 preprocessing#16806
ngxson merged 3 commits into
ggml-org:masterfrom
ngxson:xsn/idefics3-fix-preproc

ngxson commented Oct 27, 2025 •

edited

Loading

Uh oh!

ngxson Oct 27, 2025 •

edited

Loading

Uh oh!

gabe-l-hart commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ngxson commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Oct 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Oct 27, 2025 •

edited

Loading

ngxson Oct 27, 2025 •

edited

Loading