Multimodal Intelligence:
Next Token Prediction & Beyond
ICLR 2026 Workshop
Location: Rio de Janeiro, Brazil
Date: 26 or 27 April 2026

News

  • The call for papers is now open! Submit your paper here.

About

Multimodal foundation models combine vision, language, audio, video, and other modalities—so they can understand and interact with the world in richer, more human-like ways. These models are rapidly evolving beyond classic next-token prediction. They operate under broader next-X prediction principles—forecasting the next token, frame, or scale across discrete and continuous spaces. Models like Chameleon extend token prediction across modalities, while continuous approaches such as VAR, MAR, TransFusion, BAGEL, and Fluid model dynamics directly in latent spaces. In parallel, predictive encoders such as V-JEPA 2 learn by anticipating future or missing representations—capturing core structure (e.g., trajectories and interactions) without generating each token. This yields strong performance in understanding, planning, and video QA while avoiding the overhead of pixel- or token-level generation. A third paradigm—discrete diffusion models—treats generation as iterative denoising instead of left-to-right prediction. Recent systems like LLaDA, Dream, Dimple, LLaDA-V, and LaViDa show how discrete diffusion enables parallel generation, and flexible multimodal conditioning.
This workshop brings together these complementary perspectives—autoregressive modeling, predictive encoders, diffusion-based generators, and other emerging hybrids—to discuss central questions for the next generation of multimodal foundation models:
  • Which generation paradigm yields better representations for downstream multimodal tasks (e.g., visual understanding, image generation, video QA, etc.)?
  • How do these paradigms compare in terms of scaling behavior and data efficiency?
  • Can different generation paradigms be combined, and what can they learn from each other?
By bringing together researchers working on these diverse yet related foundations, this workshop aims to chart a unifying perspective on the next generation of multimodal foundation models—beyond token prediction alone, toward models that truly predict, perceive, and reason about the world.

Topics of Interest

Objective Comparisons
Head-to-head studies comparing next-token prediction vs. predictive encoding vs. diffusion under matched data/compute, with clear win conditions and ablation of design choices.
Hybrid Training Recipes
Joint or two-stage objectives (e.g., AR + latent forecasting; diffusion as a refiner), cross-paradigm distillation, and criteria for when hybrids help.
Tokenization & Representation Interfaces
Discrete tokens vs. continuous latents, visual/audio tokenizers, quantizers, and their impact on grounding, fidelity, and controllability.
Scaling & Data Efficiency
Scaling behavior across objectives, data mixtures (images/video/audio/text), synthetic/weak supervision, and long-video or streaming data curricula.
Evaluation Protocols
Standardized suites for multimodal reasoning, temporal/spatial grounding, planning/control, editing and controllability.
Applications & Embodied Agents
Robotics and long-horizon control, video QA, document understanding; linking objective choice to downstream performance.

Call for Papers

We invite submissions on multimodal foundation models, next-token prediction paradigms, efficient or unified modeling frameworks, specifically including the topics of the workshop addressed above. We will be accepting both main track and tiny papers. Submissions are welcomed on OpenReview.

Submission Tracks

Main Track: The main track welcomes submissions of up to 8 pages, excluding references and supplementary materials. We invite high-quality research that presents original, unpublished work or work currently under submission elsewhere. We will also consider recently published papers. Submissions in this track are expected to make technical or conceptual contributions related to multimodal foundation models.

Tiny Papers Track: We will feature a Tiny Papers Track for shorter contributions of up to 4 pages. This track is intended for works-in-progress, exploratory studies, providing a space for authors to receive feedback and refine early ideas. We encourage submissions from students, early-career researchers, and participants from underrepresented or under-resourced backgrounds to share preliminary findings, exchange ideas, and build collaborations within the community. Authors of these papers will be earmarked for potential funding from ICLR, but need to submit a separate application for Financial Assistance that evaluates their eligibility. This application for Financial Assistance to attend ICLR 2026 will become available on here at the beginning of February and close early March.

Key Dates

  • Paper Deadline: 5 February 2026 11:59 PM (AoE)
  • Acceptance Notification: 1 March 2026 11:59 PM (AoE)
  • Camera-ready: 8 March 2026 11:59 PM (AoE)
  • Workshop Date: 26 or 27 April 2026

Submission Guidelines

Format: All submissions must be a single PDF file. Submissions should follow the ICLR 2026 style guidelines, available here. References and supplementary materials are not included in the page limit, but the main text must be self-contained.

Dual-submission and non-archival policy: We welcome ongoing and unpublished work. We will also accept papers that are under review at the time of submission, or that have been recently accepted, provided they do not breach any dual-submission or anonymity policies of those venues. The workshop is a non-archival venue and will not have official proceedings.

Double-blind reviewing: All submissions must be anonymized and may not contain any identifying information that may violate the double-blind reviewing policy. This policy applies to any supplementary or linked material as well, including code.

Camera-Ready Submissions: Authors of accepted papers will be invited to submit a final version incorporating reviewer feedback. Accepted papers will be made publicly available on OpenReview before the workshop.

Contact: For any questions, please contact the organizers at [email protected].

🏆 Best Paper Award: We will present a Best Paper Award to recognize outstanding contributions to the workshop.


Schedule

Detailed schedule will be announced soon.

Invited Speakers



Juan Carlos Niebles

Salesforce / Stanford University

Mike Z. Shou

National University of Singapore

Chelsea Finn (TBC)

Stanford University

Hanna Hajishirzi (TBC)

Allen Institute for AI / University of Washington

All speakers will be confirmed soon.

Workshop Organizers



Ivona Najdenkoska

University of Amsterdam

Mohammad Mahdi Derakhshani

University of Amsterdam

Marzieh Fadaee

Cohere Lab

Kai Han

University of Hong Kong

Saining Xie

New York University / Google DeepMind

Yuki M. Asano

Technical University of Nuremberg

Cees G. M. Snoek

University of Amsterdam

For any questions, please contact:

[email protected]