Question about training sample design.

Hello, thank you very much for your work. I would like to ask why not directly input a question and multiple candidate images separately, but instead adopt a method of **stitching** the problem images (e.g., for the visual jigsaw)? How is this related to training?

Additionally, can the order of image and text in the response be swapped in training? In my understanding, the thinking process in Bagel was originally designed for editing tasks, so the thinking text is before the image. But for understanding tasks, can we swap this order before answering?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about training sample design. #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question about training sample design. #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions