Skip to content

Question about training sample design. #3

@qirui-chen

Description

@qirui-chen

Hello, thank you very much for your work. I would like to ask why not directly input a question and multiple candidate images separately, but instead adopt a method of stitching the problem images (e.g., for the visual jigsaw)? How is this related to training?

Additionally, can the order of image and text in the response be swapped in training? In my understanding, the thinking process in Bagel was originally designed for editing tasks, so the thinking text is before the image. But for understanding tasks, can we swap this order before answering?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions