🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor #39591

yonigozlan · 2025-07-22T18:37:38Z

What does this PR do?

As discussed internally, this PR starts the process to make fast image processors the default in 🤗Transformers!

When instantiating a Processor or an Image Processor via AutoProcessor.from_pretrained or AutoImageProcessor.from_pretrained with a checkpoint using a Qwen2VLImageProcessor, the behavior will now be, to load Qwen2VLImageProcessorFast even if the processor was saved with a slow Qwen2VLImageProcessor originally.

For instance:
Old behavior:

>> from transformers import AutoProcessor
>> processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
>> print(type(processor.image_processor))
<class 'transformers.models.qwen2_vl.image_processing_qwen2_vl.Qwen2VLImageProcessor'>

New behavior:

>> from transformers import AutoProcessor
>> processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
>> print(type(processor.image_processor))
"""The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release."""
<class 'transformers.models.qwen2_vl.image_processing_qwen2_vl_fast.Qwen2VLImageProcessorFast'>

( The warning is a warning_once)

This PR also comes with a long overdue refactor (which should be 100% compatible with the slow image processor of qwen2 vl, and fix some existing inconsistencies with the fast one). Cc @zucchini-nlp for that :)

🚨The processed images in output between the slow and fast image processor are slightly different! This is expected as torchvision and PiL image processing functions are not fully equivalent.
Users can still force the use of a slow processor by loading the processor with use_fast=False

>> from transformers import AutoProcessor
>> processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", use_fast=False)
>> print(type(processor.image_processor))
<class 'transformers.models.qwen2_vl.image_processing_qwen2_vl.Qwen2VLImageProcessor'>

Here are some comparison between fast and slow processors with this refactor.
Mixed various means images of different sizes are included in input. The images used for these benchmarks is this one

Summary of the summary: up to 30x speedup, between 5e-8 and 3e-3 average output pixel differences depending on the processing parameters and input image sizes

Summary: Max Output Difference vs. Slow processor

This table shows the maximum difference at any single point between the output tensors of the Fast processors and the Slow processor.

Batch Size	1	4	8	16	32	64
('mixed_various', 'Fast_cpu_grouping_disabled')	2.384e-07	0.0292	0.0292	0.0292	0.0292	0.0292
('mixed_various', 'Fast_cpu_grouping_enabled')	2.384e-07	0.0292	0.0292	0.0292	0.0292	0.0292
('uniform_1024x1024', 'Fast_cpu_grouping_disabled')	0.0292	0.0292	0.0292	0.0292	0.0292	0.0292
('uniform_1024x1024', 'Fast_cpu_grouping_enabled')	0.0292	0.0292	0.0292	0.0292	0.0292	0.0292
('uniform_224x224', 'Fast_cpu_grouping_disabled')	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07
('uniform_224x224', 'Fast_cpu_grouping_enabled')	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07
('uniform_512x512', 'Fast_cpu_grouping_disabled')	0.01501	0.01501	0.01501	0.01501	0.01501	0.01501
('uniform_512x512', 'Fast_cpu_grouping_enabled')	0.01501	0.01501	0.01501	0.01501	0.01501	0.01501
('mixed_various', 'Fast_cuda_grouping_disabled')	2.384e-07	0.09005	0.09005	0.09005	0.09005	0.09005
('mixed_various', 'Fast_cuda_grouping_enabled')	2.384e-07	0.09005	0.09005	0.09005	0.09005	0.09005
('uniform_1024x1024', 'Fast_cuda_grouping_disabled')	0.04266	0.04266	0.04266	0.04266	0.04266	0.04266
('uniform_1024x1024', 'Fast_cuda_grouping_enabled')	0.04266	0.04266	0.04266	0.04266	0.04266	0.04266
('uniform_224x224', 'Fast_cuda_grouping_disabled')	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07
('uniform_224x224', 'Fast_cuda_grouping_enabled')	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07	2.384e-07
('uniform_512x512', 'Fast_cuda_grouping_disabled')	0.09005	0.09005	0.09005	0.09005	0.09005	0.09005
('uniform_512x512', 'Fast_cuda_grouping_enabled')	0.09005	0.09005	0.09005	0.09005	0.09005	0.09005

Summary: Mean Absolute Output Difference vs. Slow processor

This table shows the mean absolute difference between the output tensors of the Fast processors and the Slow processor for each configuration and image scenario.

Batch Size	1	4	8	16	32	64
('mixed_various', 'Fast_cpu_grouping_disabled')	5.315e-08	7.732e-05	7.452e-05	7.956e-05	7.892e-05	8e-05
('mixed_various', 'Fast_cpu_grouping_enabled')	5.315e-08	7.732e-05	7.452e-05	7.956e-05	7.892e-05	8e-05
('uniform_1024x1024', 'Fast_cpu_grouping_disabled')	9.615e-05	9.615e-05	9.615e-05	9.615e-05	9.615e-05	9.615e-05
('uniform_1024x1024', 'Fast_cpu_grouping_enabled')	9.615e-05	9.615e-05	9.615e-05	9.615e-05	9.615e-05	9.615e-05
('uniform_224x224', 'Fast_cpu_grouping_disabled')	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08
('uniform_224x224', 'Fast_cpu_grouping_enabled')	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08
('uniform_512x512', 'Fast_cpu_grouping_disabled')	2.832e-05	2.832e-05	2.832e-05	2.832e-05	2.832e-05	2.832e-05
('uniform_512x512', 'Fast_cpu_grouping_enabled')	2.832e-05	2.832e-05	2.832e-05	2.832e-05	2.832e-05	2.832e-05
('mixed_various', 'Fast_cuda_grouping_disabled')	5.315e-08	0.002611	0.002679	0.002686	0.0027	0.002701
('mixed_various', 'Fast_cuda_grouping_enabled')	5.315e-08	0.002611	0.002679	0.002686	0.0027	0.002701
('uniform_1024x1024', 'Fast_cuda_grouping_disabled')	0.002783	0.002783	0.002783	0.002783	0.002783	0.002783
('uniform_1024x1024', 'Fast_cuda_grouping_enabled')	0.002783	0.002783	0.002783	0.002783	0.002783	0.002783
('uniform_224x224', 'Fast_cuda_grouping_disabled')	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08
('uniform_224x224', 'Fast_cuda_grouping_enabled')	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08	5.315e-08
('uniform_512x512', 'Fast_cuda_grouping_disabled')	0.002913	0.002913	0.002913	0.002913	0.002913	0.002913
('uniform_512x512', 'Fast_cuda_grouping_enabled')	0.002913	0.002913	0.002913	0.002913	0.002913	0.002913

Time per images:

With different image sizes:

Speedups:

With different image sizes:

Cc @qubvel @ArthurZucker @Cyrilvallez

HuggingFaceDocBuilderDev · 2025-07-22T18:50:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Yaaay, great work, happy to see fast processors being the default 🚀

src/transformers/models/qwen2_vl/image_processing_qwen2_vl_fast.py

zucchini-nlp · 2025-07-23T08:03:17Z

src/transformers/models/qwen2_vl/image_processing_qwen2_vl_fast.py


-    @auto_docstring
-    def preprocess(
+    def _preprocess_videos(


I think we shouldn't further maintain videos with all new features and keep a separate fn to preprocess them. WDYT if we feed video to self._preprocess_image and set disable_grouping=False? AFAIK that is the only diff for Qwen

You're right! removed it in the code. Only change necessary is to add the temporal dimensions only for images and not for video

zucchini-nlp · 2025-07-23T08:04:42Z

src/transformers/models/qwen2_vl/image_processing_qwen2_vl_fast.py

+            processed_videos_grouped[shape] = flatten_patches
+            processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size

-    def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None):


Thus should not be removed! It is used by vLLM to infer number of patches and placeholders without an image input. Can you add a tiny comment in docstring as well, so we don't delete it again accidentally?

Sorry I must have been to eager to delete the old code 😅, will add a comment!

@zucchini-nlp, hah, you said no one will touch this code 😆 do we plan to have some vLLM integration tests to check the required methods/attributes are still exist?

haha my bad, will add some tests for new helpers 😄

src/transformers/models/auto/image_processing_auto.py

qubvel

Thanks for working on this 🤗 huge speed up!!

src/transformers/models/qwen2_vl/image_processing_qwen2_vl_fast.py

src/transformers/models/auto/image_processing_auto.py

qubvel · 2025-07-23T10:24:11Z

src/transformers/models/auto/image_processing_auto.py

 logger = logging.get_logger(__name__)


+FORCE_FAST_IMAGE_PROCESSOR = ["Qwen2VLImageProcessor"]


we should probably give the option to opt out -> make it a form_pretrained arg?

@ArthurZucker use_fast=False still would work, that's for default behaviour, when use_fast is not provided for from_pretrained

Yes exactly as @qubvel said :)

yonigozlan · 2025-07-23T16:30:03Z

Thanks for the review @qubvel @zucchini-nlp , made the modifications! I also needed to change quite a few processor tests for qwen vls/omni because of the change to fast image processor by default, but should be good now!

zucchini-nlp

Thanks for iterating!

github-actions · 2025-07-24T21:25:05Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, colqwen2, glm4v, qwen2_5_omni, qwen2_5_vl, qwen2_vl

ArthurZucker

Thanks 🤗

ArthurZucker · 2025-07-25T08:43:36Z

src/transformers/models/auto/image_processing_auto.py

 logger = logging.get_logger(__name__)


+FORCE_FAST_IMAGE_PROCESSOR = ["Qwen2VLImageProcessor"]


we should probably give the option to opt out -> make it a form_pretrained arg?

yhyang201 · 2025-08-13T11:55:04Z

Hi
~~I noticed that this PR appears to change the behavior of the Qwen2-VL processor (I haven’t tested Qwen2.5-VL yet, but I suspect it shows a similar pattern).~~
~~Using the Qwen2.5-VL demo image (https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg) as an example:~~
~~- Before the PR, the processor produced pixel_values with shape [4988, 1176];~~
~~- After the PR, the shape becomes [14308, 1176].~~

~~In both cases, the model can correctly interpret the image.~~

~~Could you share whether this change is intentional and necessary? If model capability remains the same, a shorter/smaller pixel_values (i.e., shorter sequence length) should lead to faster inference.~~

yonigozlan · 2025-08-13T16:14:41Z

Hi @yhyang201 , I'm not able to reproduce the issue, could you please provide a snippet to do so? Thanks.

yhyang201 · 2025-08-14T06:25:16Z

Hi @yonigozlan , I sincerely apologize — I have double-checked and confirmed that this PR does not exhibit the issue I mentioned earlier.
Thank you very much for your contribution, and I’m truly sorry for the inconvenience and for taking up your time.

yonigozlan · 2025-08-14T14:38:57Z

@yhyang201 no worries 🤗

…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels

yonigozlan added 3 commits July 22, 2025 13:00

init

acdc686

Force qwen2VL image proc to fast

ff2728f

refactor qwen2 vl fast

122461c

fix copies

b06fbd6

yonigozlan requested review from ArthurZucker, Cyrilvallez, qubvel and zucchini-nlp July 22, 2025 18:59

zucchini-nlp reviewed Jul 23, 2025

View reviewed changes

qubvel reviewed Jul 23, 2025

View reviewed changes

yonigozlan added 2 commits July 23, 2025 15:42

Update after PR review and update tests to use return_tensors="pt"

2db1b78

fix processor tests

7cba78d

zucchini-nlp approved these changes Jul 24, 2025

View reviewed changes

zucchini-nlp mentioned this pull request Jul 24, 2025

Fix Qwen2-VL image/video processor legacy size field handling #39570

Closed

5 tasks

add BC for min pixels/max pixels

14ee131

ArthurZucker approved these changes Jul 25, 2025

View reviewed changes

yonigozlan merged commit 17f0210 into huggingface:main Jul 25, 2025
25 checks passed

yonigozlan mentioned this pull request Jul 28, 2025

feat: add is_fast to ImageProcessor #39603

Merged

5 tasks

DarkLight1337 mentioned this pull request Jul 31, 2025

[RFC] Run HF processing on GPU vllm-project/vllm#21995

Open

6 tasks

mattjcly mentioned this pull request Aug 19, 2025

[Tests] Make Qwen2.5 images across messages test more flexible lmstudio-ai/mlx-engine#213

Merged

		logger = logging.get_logger(__name__)


		FORCE_FAST_IMAGE_PROCESSOR = ["Qwen2VLImageProcessor"]

🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor #39591

🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor #39591

Uh oh!

Conversation

yonigozlan commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Summary: Max Output Difference vs. Slow processor

Summary: Mean Absolute Output Difference vs. Slow processor

Time per images:

Speedups:

Uh oh!

HuggingFaceDocBuilderDev commented Jul 22, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qubvel Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Jul 23, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yhyang201 commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonigozlan commented Aug 13, 2025

Uh oh!

yhyang201 commented Aug 14, 2025

Uh oh!

yonigozlan commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yonigozlan commented Jul 22, 2025 •

edited

Loading

qubvel Jul 23, 2025 •

edited

Loading

yhyang201 commented Aug 13, 2025 •

edited

Loading