Skip to content

Conversation

@vasqu
Copy link
Contributor

@vasqu vasqu commented Jul 22, 2025

Continuation of #39228 for the VL models

Current inference script for testing (torch 2.9.1):

import requests
from PIL import Image

from transformers import AutoConfig, AutoModelForImageTextToText, AutoProcessor


model_path = "/raid/anton/code/forks/transformers/AntonV/ErnieVL"
processor = AutoProcessor.from_pretrained(model_path)

config = AutoConfig.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    "baidu/ERNIE-4.5-VL-28B-A3B-PT",
    config=config,
    device_map="auto",
    dtype="auto",
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Only use English during your responses and describe the following image."},
            {"type": "image"},
        ]
    },
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image = Image.open(requests.get("https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg", stream=True).raw)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=64,
    do_sample=False,
)
print(processor.decode(generated_ids[0][len(inputs['input_ids'][0]):]))

Output:
The image features a scenic view of rolling mountains under a clear sky. In the foreground, there is a person sitting on a hillside, facing away from the camera. The person is wrapped in a colorful, striped blanket. The landscape is dotted with vibrant pink flowers, adding a pop of color to the

Left TODOs:

  • TP
    • EP to add later after its been fixed

@vasqu
Copy link
Contributor Author

vasqu commented Dec 16, 2025

run-slow: ernie4_5_vl_moe

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

… specifically load via pretrained (overridable from pretrained for auto classes)
@vasqu
Copy link
Contributor Author

vasqu commented Dec 16, 2025

run-slow: ernie4_5_vl_moe

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@vasqu
Copy link
Contributor Author

vasqu commented Dec 17, 2025

run-slow: ernie4_5_vl_moe

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read, re-read, re-re-read, I think this series is converging hehe. I left a few small comments, apologies if it's already treated above.
massive work!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processor is complicated but I find it readable in the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it's quite hacky especially since video and image treat with different final dimensions. The mm_token_type ids will have to be used at some point in all VLMs, makes it compatible with vLLM iirc

# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])

height, width = get_image_size(images[0], channel_dim=input_data_format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also assume that all images have same base size, or should get_image_size be in the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same as in qwen vl and glm4v, apparently it does assume the same size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I deep dived a bit and it does look like there are more than 1 image at a time at first glance but if you look closer then we process one image at a time already (the for loop could be ignored)

  1. Loop through the images and call _preprocess
    for image in images:
    patches, image_grid_thw = self._preprocess(
    image,
    do_resize=do_resize,
    size=size,
    resample=resample,
    do_rescale=do_rescale,
    rescale_factor=rescale_factor,
    do_normalize=do_normalize,
    image_mean=image_mean,
    image_std=image_std,
    patch_size=patch_size,
    temporal_patch_size=temporal_patch_size,
    merge_size=merge_size,
    data_format=data_format,
    do_convert_rgb=do_convert_rgb,
    input_data_format=input_data_format,
    )
    pixel_values.extend(patches)
    vision_grid_thws.append(image_grid_thw)
  2. _preprocess loops as well (but it's a list of one image)
  3. The fast processor correctly groups instead and does images at once that have the same size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, thanks for the deep dive! I think the _preprocess interface is a bit of the weak link here tbh, things were build around it instead of inside it, in several places... not a very re-usable API as it stands (it was fine when everything was PIL-based)

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, qwen2_5_vl, qwen2_vl, qwen3_omni_moe, qwen3_vl

@vasqu
Copy link
Contributor Author

vasqu commented Dec 19, 2025

run-slow: ernie4_5_vl_moe

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe"]
quantizations: []

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

  • ernie4_5_vl_moe:
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeModelTest::test_init_weights_can_init_buffers
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_different_resolutions
    tests/models/ernie4_5_vl_moe/test_modeling_ernie4_5_vl_moe.py::Ernie4_5_VL_MoeSmallIntegrationTest::test_small_model_integration_test_batch_wo_image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants