-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Description
Hello, thank you very much for your work. I would like to ask why not directly input a question and multiple candidate images separately, but instead adopt a method of stitching the problem images (e.g., for the visual jigsaw)? How is this related to training?
Additionally, can the order of image and text in the response be swapped in training? In my understanding, the thinking process in Bagel was originally designed for editing tasks, so the thinking text is before the image. But for understanding tasks, can we swap this order before answering?
Metadata
Metadata
Assignees
Labels
No labels