[`Ernie 4.5`] Ernie VL models #39585

vasqu · 2025-07-22T15:45:07Z

Continuation of #39228 for the VL models

Current inference script for testing (torch 2.9.1):

import requests
from PIL import Image

from transformers import AutoConfig, AutoModelForImageTextToText, AutoProcessor


model_path = "/raid/anton/code/forks/transformers/AntonV/ErnieVL"
processor = AutoProcessor.from_pretrained(model_path)

config = AutoConfig.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    "baidu/ERNIE-4.5-VL-28B-A3B-PT",
    config=config,
    device_map="auto",
    dtype="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Only use English during your responses and describe the following image."},
            {"type": "image"},
        ]
    },
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image = Image.open(requests.get("https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg", stream=True).raw)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
)
print(processor.decode(generated_ids[0][len(inputs['input_ids'][0]):]))

Output:
The image features a scenic view of rolling mountains under a clear sky. In the foreground, there is a person sitting on a hillside, facing away from the camera. The person is wrapped in a colorful, striped blanket. The landscape is dotted with vibrant pink flowers, adding a pop of color to the

Left TODOs:

TP
- EP to add later after its been fixed

…essor for now

…original formula (torch.allclose always True) leading to slightly different generations

vasqu · 2025-12-16T18:52:29Z

run-slow: ernie4_5_vl_moe

github-actions · 2025-12-16T18:53:37Z

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

github-actions · 2025-12-16T19:09:25Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

… specifically load via pretrained (overridable from pretrained for auto classes)

vasqu · 2025-12-16T20:31:02Z

run-slow: ernie4_5_vl_moe

github-actions · 2025-12-16T20:32:06Z

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

github-actions · 2025-12-16T20:47:26Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

…egration test

vasqu · 2025-12-17T17:22:13Z

run-slow: ernie4_5_vl_moe

github-actions · 2025-12-17T17:23:31Z

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

github-actions · 2025-12-17T17:42:11Z

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

molbap

Read, re-read, re-re-read, I think this series is converging hehe. I left a few small comments, apologies if it's already treated above.
massive work!

src/transformers/models/ernie4_5_vl_moe/processing_ernie4_5_vl_moe.py

molbap · 2025-12-18T09:58:28Z

src/transformers/models/ernie4_5_vl_moe/processing_ernie4_5_vl_moe.py

The processor is complicated but I find it readable in the end.

Yea it's quite hacky especially since video and image treat with different final dimensions. The mm_token_type ids will have to be used at some point in all VLMs, makes it compatible with vLLM iirc

src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py

molbap · 2025-12-18T15:48:21Z

src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py

+            # We assume that all images have the same channel dimension format.
+            input_data_format = infer_channel_dimension_format(images[0])
+
+        height, width = get_image_size(images[0], channel_dim=input_data_format)


we also assume that all images have same base size, or should get_image_size be in the loop?

it's the same as in qwen vl and glm4v, apparently it does assume the same size

Ok, so I deep dived a bit and it does look like there are more than 1 image at a time at first glance but if you look closer then we process one image at a time already (the for loop could be ignored)

Loop through the images and call _preprocess

transformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py

Lines 403 to 422 in 1fd297d

for image in images:

patches, image_grid_thw = self._preprocess(

image,

do_resize=do_resize,

size=size,

resample=resample,

do_rescale=do_rescale,

rescale_factor=rescale_factor,

do_normalize=do_normalize,

image_mean=image_mean,

image_std=image_std,

patch_size=patch_size,

temporal_patch_size=temporal_patch_size,

merge_size=merge_size,

data_format=data_format,

do_convert_rgb=do_convert_rgb,

input_data_format=input_data_format,

)

pixel_values.extend(patches)

vision_grid_thws.append(image_grid_thw)

_preprocess loops as well (but it's a list of one image)

transformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py

Line 228 in 1fd297d

images = make_list_of_images(images)

transformers/src/transformers/models/ernie4_5_vl_moe/image_processing_ernie4_5_vl_moe.py

Line 248 in 1fd297d

for image in images:

The fast processor correctly groups instead and does images at once that have the same size

okay, thanks for the deep dive! I think the _preprocess interface is a bit of the weak link here tbh, things were build around it instead of inside it, in several places... not a very re-usable API as it stands (it was fine when everything was PIL-based)

src/transformers/models/ernie4_5_vl_moe/modular_ernie4_5_vl_moe.py

github-actions · 2025-12-19T11:08:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, qwen2_5_vl, qwen2_vl, qwen3_omni_moe, qwen3_vl

vasqu · 2025-12-19T11:08:28Z

run-slow: ernie4_5_vl_moe

github-actions · 2025-12-19T11:09:38Z

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

github-actions · 2025-12-19T11:17:18Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=39585&sha=15743c

github-actions · 2025-12-19T11:34:36Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

ernie4_5_vl_moe:
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeModelTest::test_init_weights_can_init_buffers
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_different_resolutions
tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_wo_image

vasqu added 27 commits July 22, 2025 12:06

init

339a89c

lets tmp disable cache init

eb9d6b4

some initial remote code version, for local inference use remote proc…

4260a62

…essor for now

first cleanups

26f06a2

need to do this slowly

b3d999a

more attention cleanup

1e190e2

llama like text attention

b44101d

generates different text but cos and sin tensors are always close - 1e-8

b38e048

another round of rope fixups

fcf3903

yea, gonna check tomorrow cant cheat w freqs for whatever reason

62206ee

NOTE: last time where comp with old rope

7e7d8e4

rope cleanup

fca8fba

more rope

db80573

somewhat clean 3d rope with attn - sin / cos has very small diffs to …

e82297b

…original formula (torch.allclose always True) leading to slightly different generations

new rope type

8540938

style

dfe6714

attempt at moe, gonna need a deeper look

1153291

cleanup gate

39c77ef

more cleaning

aadf423

NOTE remove attempt at moe for now

096529d

another round of cleanups

3820cc6

whoops

b25a458

we back boys, reattempting moe start

04a7882

moe should be done with this

b16737f

cleanup

30acfda

more cleanup

5b6efdd

nits

46efff9

CSWYF3634076 mentioned this pull request Aug 15, 2025

[Model] Add Ernie4.5 VL Model Support vllm-project/vllm#22514

Merged

vasqu added 2 commits August 18, 2025 20:05

add conversion and adjust code accordingly

7303a31

fix

cba549f

vasqu added 2 commits December 16, 2025 18:59

Merge branch 'main' into ernie_vl

2468a6f

style

7aba463

fix rope modeling to follow qwen2 vl instead + change auto loading to…

14b2bf3

… specifically load via pretrained (overridable from pretrained for auto classes)

vasqu mentioned this pull request Dec 16, 2025

[Auto] Make processor subclasses overridable on load time #42912

Merged

vasqu and others added 9 commits December 16, 2025 22:07

seems to be skipped in other similar vlms

742bb79

Merge branch 'main' into ernie_vl

882e62a

small conversion updates and adjust max vram usage during the big int…

1eafef4

…egration test

update test paths

b44ea0e

style

8e8016c

style attmpt 2

c0a26c5

docs

251e91f

Merge branch 'main' into ernie_vl

681a0d2

trigger ci

1fd297d

molbap approved these changes Dec 18, 2025

View reviewed changes

review

0fa0037

vasqu force-pushed the ernie_vl branch from 15743cf to 0fa0037 Compare December 19, 2025 12:33

	for image in images:
	patches, image_grid_thw = self._preprocess(
	image,
	do_resize=do_resize,
	size=size,
	resample=resample,
	do_rescale=do_rescale,
	rescale_factor=rescale_factor,
	do_normalize=do_normalize,
	image_mean=image_mean,
	image_std=image_std,
	patch_size=patch_size,
	temporal_patch_size=temporal_patch_size,
	merge_size=merge_size,
	data_format=data_format,
	do_convert_rgb=do_convert_rgb,
	input_data_format=input_data_format,
	)
	pixel_values.extend(patches)
	vision_grid_thws.append(image_grid_thw)

[Ernie 4.5] Ernie VL models #39585

Are you sure you want to change the base?

[Ernie 4.5] Ernie VL models #39585

Conversation

vasqu commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

CI Results

Uh oh!

vasqu commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

Uh oh!

github-actions bot commented Dec 16, 2025

CI Results

Uh oh!

vasqu commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

CI Results

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

molbap Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

molbap Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

vasqu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

molbap Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

vasqu commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[`Ernie 4.5`] Ernie VL models #39585

[`Ernie 4.5`] Ernie VL models #39585

vasqu commented Jul 22, 2025 •

edited

Loading