Simplify Idefics2, Idefics3, SmolVLM images handling #37291

yonigozlan · 2025-04-04T17:18:51Z

Simplify the handling of images in both processing and modeling.

Now the images/patches are flattened before being processed and passed to the models. This means that the image processing is simplified (no need for padding in the number of images/patches dimension), along with the modeling code ( No more padding images/patches containing only 0/False needing to be removed).

I tested thoroughly for each models with multiple images, batched images etc. and found no differences.

Cc @andimarafioti @orrzohar

github-actions · 2025-04-04T17:19:07Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-04-04T17:44:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yonigozlan · 2025-04-07T23:58:49Z

@zucchini-nlp Hello! Pinging you here as smolvlm also handles video inputs, and I'm wondering what you think about having flattened pixel_values by default when processing videos, instead of grouping them by frames or video instance. Also since most image (and maybe video?) processors for vlm using some kind of patching/splitting flatten the patches when preprocessing, we might want to update the base processing tests to account for that? Or at least make them parameterized

zucchini-nlp

Cool, thanks for cleaning it! I have a few concerns though.

After this PR idefics models will output pixels where first dim is not necessarily the batch size, whenever an image splitting happens. We had problems in the past with Gemma3 (huggingface/trl#3121 (comment)) and Qwen2-VL (#33666) for the same reason. Tl;DR; train loaders/frameworks iterate over data assuming the first dim is batch and fail when it is not.
I realize this is not a common case, but we might be breaking train for some users with this. So I'm a bit hesitant to return flat images. LMK what you think about it
Do the model logits stay same if we test with several batches and several images per batch? Let's run slow tests before merging :)

zucchini-nlp · 2025-04-10T10:41:44Z

src/transformers/models/idefics2/modeling_idefics2.py

-        elif inputs_embeds is not None:
-            batch_size, seq_length, _ = inputs_embeds.shape
-        else:
+        if input_ids is None and inputs_embeds is None:


need to also check cases when both are not None:

if (input_ids is None) ^ (inputs_embeds is not None):

zucchini-nlp · 2025-04-10T10:47:52Z

src/transformers/models/idefics3/modeling_idefics3.py

-        elif inputs_embeds is not None:
-            batch_size, seq_length, _ = inputs_embeds.shape
-        else:
+        if input_ids is None and inputs_embeds is None:


ArthurZucker

thanks for working on this!

ArthurZucker · 2025-04-11T12:27:15Z

src/transformers/models/idefics2/image_processing_idefics2.py

-            input_data_format = infer_channel_dimension_format(images_list[0][0])
+            input_data_format = infer_channel_dimension_format(images[0])

        if do_image_splitting:


why don't we try to do a single for loop while we are at it?

pcuenca · 2025-06-24T13:35:35Z

Just for info, there's a bug in video processing with smolvlm2 where the list of frames is malformed after calling processor.apply_chat_template, which makes generation not work properly. This started happening in transformers 4.52.1 (or maybe 4.52.0, which was yanked). This PR works fine.

I saw it reported here, but it's a transformers issue, not MLX's.

process flatten images directly for idefics2 idefics3 smolvlm

68d3f0e

github-actions bot marked this pull request as draft April 4, 2025 17:19

fix missing attention mask for padded image

8051d94

yonigozlan marked this pull request as ready for review April 4, 2025 18:09

Merge branch 'main' into flatten-idefics3-im-proc

b4c187c

github-actions bot requested review from ArthurZucker and qubvel April 4, 2025 18:09

yonigozlan added 4 commits April 4, 2025 18:29

fix when pixels_attention_mask is none

01f86c0

fix modeling tests

71125eb

fix style

fb17dc4

nit

ce2a37a

This was referenced Apr 7, 2025

Idefics2 Fast Image processor #37168

Open

Add Fast Image Processor for Idefics3 #37045

Open

fix processors tests

fdbd9df

yonigozlan requested a review from zucchini-nlp April 7, 2025 23:54

zucchini-nlp reviewed Apr 10, 2025

View reviewed changes

ArthurZucker reviewed Apr 11, 2025

View reviewed changes

pcuenca mentioned this pull request Jun 24, 2025

[Error] Getting Files: None when using smolvlm_video_generate Blaizzy/mlx-vlm#388

Closed

pcuenca mentioned this pull request Jun 24, 2025

smolvlm video processing #39006

Closed

yonigozlan closed this Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify Idefics2, Idefics3, SmolVLM images handling #37291

Simplify Idefics2, Idefics3, SmolVLM images handling #37291

Uh oh!

yonigozlan commented Apr 4, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 4, 2025

Uh oh!

yonigozlan commented Apr 7, 2025 •

edited

Loading

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Apr 10, 2025

Uh oh!

zucchini-nlp Apr 10, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Apr 11, 2025

Uh oh!

pcuenca commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Simplify Idefics2, Idefics3, SmolVLM images handling #37291

Simplify Idefics2, Idefics3, SmolVLM images handling #37291

Uh oh!

Conversation

yonigozlan commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 4, 2025

Uh oh!

yonigozlan commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yonigozlan commented Apr 4, 2025 •

edited

Loading

yonigozlan commented Apr 7, 2025 •

edited

Loading