-
Notifications
You must be signed in to change notification settings - Fork 31.4k
🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor #39591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor #39591
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yaaay, great work, happy to see fast processors being the default 🚀
|
|
||
| @auto_docstring | ||
| def preprocess( | ||
| def _preprocess_videos( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't further maintain videos with all new features and keep a separate fn to preprocess them. WDYT if we feed video to self._preprocess_image and set disable_grouping=False? AFAIK that is the only diff for Qwen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right! removed it in the code. Only change necessary is to add the temporal dimensions only for images and not for video
| processed_videos_grouped[shape] = flatten_patches | ||
| processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size | ||
|
|
||
| def get_number_of_image_patches(self, height: int, width: int, images_kwargs=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus should not be removed! It is used by vLLM to infer number of patches and placeholders without an image input. Can you add a tiny comment in docstring as well, so we don't delete it again accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I must have been to eager to delete the old code 😅, will add a comment!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp, hah, you said no one will touch this code 😆 do we plan to have some vLLM integration tests to check the required methods/attributes are still exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha my bad, will add some tests for new helpers 😄
qubvel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this 🤗 huge speed up!!
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| FORCE_FAST_IMAGE_PROCESSOR = ["Qwen2VLImageProcessor"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably give the option to opt out -> make it a form_pretrained arg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker use_fast=False still would work, that's for default behaviour, when use_fast is not provided for from_pretrained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly as @qubvel said :)
|
Thanks for the review @qubvel @zucchini-nlp , made the modifications! I also needed to change quite a few processor tests for qwen vls/omni because of the change to fast image processor by default, but should be good now! |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, colqwen2, glm4v, qwen2_5_omni, qwen2_5_vl, qwen2_vl |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 🤗
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| FORCE_FAST_IMAGE_PROCESSOR = ["Qwen2VLImageProcessor"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably give the option to opt out -> make it a form_pretrained arg?
|
|
|
Hi @yhyang201 , I'm not able to reproduce the issue, could you please provide a snippet to do so? Thanks. |
|
Hi @yonigozlan , I sincerely apologize — I have double-checked and confirmed that this PR does not exhibit the issue I mentioned earlier. |
|
@yhyang201 no worries 🤗 |
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
…L + Refactor (huggingface#39591) * init * Force qwen2VL image proc to fast * refactor qwen2 vl fast * fix copies * Update after PR review and update tests to use return_tensors="pt" * fix processor tests * add BC for min pixels/max pixels
What does this PR do?
As discussed internally, this PR starts the process to make fast image processors the default in 🤗Transformers!
When instantiating a Processor or an Image Processor via
AutoProcessor.from_pretrainedorAutoImageProcessor.from_pretrainedwith a checkpoint using aQwen2VLImageProcessor, the behavior will now be, to loadQwen2VLImageProcessorFasteven if the processor was saved with a slowQwen2VLImageProcessororiginally.For instance:
Old behavior:
New behavior:
( The warning is a
warning_once)This PR also comes with a long overdue refactor (which should be 100% compatible with the slow image processor of qwen2 vl, and fix some existing inconsistencies with the fast one). Cc @zucchini-nlp for that :)
🚨The processed images in output between the slow and fast image processor are slightly different! This is expected as torchvision and PiL image processing functions are not fully equivalent.
Users can still force the use of a slow processor by loading the processor with
use_fast=FalseHere are some comparison between fast and slow processors with this refactor.
Mixed various means images of different sizes are included in input. The images used for these benchmarks is this one
Summary of the summary: up to 30x speedup, between 5e-8 and 3e-3 average output pixel differences depending on the processing parameters and input image sizes
Summary: Max Output Difference vs. Slow processor
This table shows the maximum difference at any single point between the output tensors of the Fast processors and the Slow processor.
Summary: Mean Absolute Output Difference vs. Slow processor
This table shows the mean absolute difference between the output tensors of the Fast processors and the Slow processor for each configuration and image scenario.
Time per images:
With different image sizes:
Speedups:
With different image sizes:
Cc @qubvel @ArthurZucker @Cyrilvallez