-
Notifications
You must be signed in to change notification settings - Fork 31.5k
[Ernie 4.5] Ernie VL models
#39585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Ernie 4.5] Ernie VL models
#39585
Conversation
…original formula (torch.allclose always True) leading to slightly different generations
|
run-slow: ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
… specifically load via pretrained (overridable from pretrained for auto classes)
|
run-slow: ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
run-slow: ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
molbap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read, re-read, re-re-read, I think this series is converging hehe. I left a few small comments, apologies if it's already treated above.
massive work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processor is complicated but I find it readable in the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it's quite hacky especially since video and image treat with different final dimensions. The mm_token_type ids will have to be used at some point in all VLMs, makes it compatible with vLLM iirc
src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py
Outdated
Show resolved
Hide resolved
| # We assume that all images have the same channel dimension format. | ||
| input_data_format = infer_channel_dimension_format(images[0]) | ||
|
|
||
| height, width = get_image_size(images[0], channel_dim=input_data_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also assume that all images have same base size, or should get_image_size be in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the same as in qwen vl and glm4v, apparently it does assume the same size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so I deep dived a bit and it does look like there are more than 1 image at a time at first glance but if you look closer then we process one image at a time already (the for loop could be ignored)
- Loop through the images and call
_preprocesstransformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py
Lines 403 to 422 in 1fd297d
for image in images: patches, image_grid_thw = self._preprocess( image, do_resize=do_resize, size=size, resample=resample, do_rescale=do_rescale, rescale_factor=rescale_factor, do_normalize=do_normalize, image_mean=image_mean, image_std=image_std, patch_size=patch_size, temporal_patch_size=temporal_patch_size, merge_size=merge_size, data_format=data_format, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, ) pixel_values.extend(patches) vision_grid_thws.append(image_grid_thw) _preprocessloops as well (but it's a list of one image)transformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py
Line 228 in 1fd297d
images = make_list_of_images(images) transformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py
Line 248 in 1fd297d
for image in images: - The fast processor correctly groups instead and does images at once that have the same size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, thanks for the deep dive! I think the _preprocess interface is a bit of the weak link here tbh, things were build around it instead of inside it, in several places... not a very re-usable API as it stands (it was fine when everything was PIL-based)
src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py
Outdated
Show resolved
Hide resolved
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, qwen2_5_vl, qwen2_vl, qwen3_omni_moe, qwen3_vl |
|
run-slow: ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe"] |
|
View the CircleCI Test Summary for this PR: |
CI ResultsModel CI Report❌ Failed tests
|
Continuation of #39228 for the VL models
Current inference script for testing (torch 2.9.1):
Output:
The image features a scenic view of rolling mountains under a clear sky. In the foreground, there is a person sitting on a hillside, facing away from the camera. The person is wrapped in a colorful, striped blanket. The landscape is dotted with vibrant pink flowers, adding a pop of color to theLeft TODOs: