The 5th Workshop on Computer Vision in the Wild
Theme: Building Multimodal AI Agents with Verbal, Spatial and Temporal Intelligence
Date: June 3-4 | Location: Denver Convention Center, Denver CO

Overview

As artificial intelligence continues to evolve, the intersection of vision, language, and action is becoming central to systems that must operate reliably in unconstrained, real world settings. The 5th Workshop on Computer Vision in the Wild (CVinW) at CVPR 2026 aims to bring together researchers and practitioners advancing multimodal AI agents that can perceive, reason, and act across digital and physical environments, while highlighting the capabilities where today’s models still fall short. Building on the success of our previous workshops: CVPR 2025 CVinW Workshop, CVPR 2024 CVinW Workshop, CVPR 2023 CVinW Workshop and ECCV 2022 CVinW Workshop, this year’s edition focuses on the intersection of large multimodal models (LMMs) and vision-language-action (VLA) systems, with particular emphasis on fine grained spatiotemporal reasoning, causal inference, long horizon planning and memory, and robust tool use. The workshop focuses on moving beyond static understanding toward agents that perceive, reason, and act in dynamic environments, including interactive digital settings and embodied physical interaction.

Image source: Vision-Language Pre-training: Basics, Recent Advances, and Future Trends and Multimodal Foundation Models: From Specialists to General-Purpose Assistants

Over the past years, we have witnessed remarkable advancements in open-vocabulary visual comprehension models and multimodal learning. Five years ago, vision-language models or multimodal models are mostly built on top of the BERT architecture. Typically, these models contain less than 1B parameters, and trained with a small amount of images. Some representative works are like ViLBERT , UNITER, and VisualBERT, etc. They are mostly used for image-text matching tasks such as visual question answering (VQA) and image captioning. Later on, we have seen the emergence of multimodal vision foundation models, such as CLIP, ALIGN and Florence. It scaled up the multimodal training to billions of images. Despite the model size is still relatively small, it shows strong open-vocabulary and zero-shot recognition capability across a wide range of visual domains. These strong capabilities have been further transferred to fine-grained core vision tasks such as object detection (e.g., M-DETR, ViLD, GLIP, RegionCLIP, GroundingDINO, OWL-ViT, etc), image segmentations (e.g., X-Decoder, SegGPT, SEEM, SAM, LISA, etc). Most recently, we entered the era of large multimodal models. Connecting the multimodal vision models such as CLIP with large language models such as Flamingo, Gemini, GPT-4V, leading to many advanced multimodal capability. Now we can have a multimodal chatbot such as GPT-4o, LLaVA-OneVision, Qwen-2.5-VL and Phi-4-Multimodal, which can see, talk and reasoning

Despite these successes, current vision models still lack the ability to fully grasp temporal dynamics, causal reasoning, and embodied interactions—key elements for autonomous agents that can see, reason, and act. Some recent works have attempted to address these challenges by building agentic models and VLA models. Our workshop aims to bringing together leading researchers, practitioners, and industry experts to discuss these emerging trends, challenges, and solutions in the field.

Highlights

(1) Invited Talks from leading experts in academia and industry on the latest advancements in multimodal AI.

(2) Paper Presentations showcasing cutting-edge research contributions in computer vision in the wild.

(3) Panel Discussions (Tentative) exploring the future of vision-language-action models and their impact on robotics, autonomous systems, and real-world AI applications.

We invite researchers, engineers, and enthusiasts to join us in shaping the future of vision systems that go beyond static image recognition to dynamic, interactive, and real-world AI applications. Stay tuned for more details on speakers, paper submissions, and challenge participation!

For more information, visit our official workshop page and explore our CVinW reading list: 📌 CVinW Readings

Invited Speakers

Manling Li
Northwestern University

Chelsea Finn
Stanford University

Mohit Bansal
UNC Chapel Hill

Kate Saenko
Boston University, Meta MSL

Scott Yih
Meta FAIR

Morning Schedule (June 4, Thursday)

8:15 AM - 8:30 AM MT		Welcome Reuben Tan Bio [Expand] TODO
8:30 AM - 9:00 AM MT		Invited Talk Manling Li Title: Embodied Spatial Intelligence: Closing the Perception-Action Loop Abstract [Expand] Spatial intelligence in the real world is not passive perception; it is an active loop in which an agent acts to see, and sees in order to act. This talk explores closing this perception-action loop, the piece current embodied agents most lack, from several angles. When the observer is recast as an actor, active exploration uncovers what passive viewing cannot, from occlusion to containment to dynamics, and most failures trace back to not knowing how to move rather than to weak perception (ESI-Bench). Planning then becomes a matter of choosing which viewpoints to explore (Planning with the Views), while acting on an instruction exposes the gap between high-level semantics and precise physical execution (ActionEQA). Embodied priors with online reinforcement learning offer one way to turn a vision-language model into an agent that lives in this loop (Embodied Reasoning Agent). Bio [Expand] Manling Li is an Assistant Professor at Northwestern University and an Amazon Scholar. She was a postdoc at Stanford University, and obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on Reasoning, Planning and Compositionality, in the intersection of Language, Vision, and Robotics. Her work has been recognized as ACL 2025 Dissertation Award Honorable Mention, Outstanding Paper Award at ACL’24, Best Demo Paper Award at NAACL’21 and ACL’20, MIT Tech Review 35 Innovators Under 35, etc. She led the tutorials/workshops/challenges of Foundation Models meet Embodied Agents.
9:00 AM - 9:30 AM MT		Invited Talk Scott Yih Title: Say Less, Know More: Latent Representations and Procedural Retrieval for Efficient Reasoning Abstract [Expand] LLM agents rely on chain-of-thought reasoning to decompose goals, utilize tools, and self-correct, but the context window remains a critical bottleneck: every reasoning step consumes tokens that could otherwise store task states, tool outputs, or memory. The "overthinking" phenomenon—where models generate repetitive, unnecessarily verbose reasoning traces—exacerbates this issue. This talk presents two complementary approaches to reducing the computational and token cost of reasoning. First, LaTexT introduces a hybrid framework that interleaves latent (continuous vector) tokens with standard text tokens during reasoning. By compressing semantic reasoning into latent space while selectively retaining critical mathematical tokens in text form—and employing an efficient iterative parallelized rollout for training—LaTexT matches 94% of full-text chain-of-thought performance. Simultaneously, it reduces cumulative attention computation by 49% and improves inference throughput by 13%. Second, Reasoning Memory takes an orthogonal approach: instead of compressing the internal reasoning process, it augments it with external procedural knowledge retrieved from prior problem-solving experience. By decomposing existing reasoning trajectories into 32 million compact subquestion–subroutine pairs, the framework enables models to retrieve relevant strategies directly within their thinking stream. A diversity-first scaling strategy explores multiple retrieved procedures in parallel, consistently outperforming document-level RAG and compute-matched baselines by up to 19.2%. Together, these methods address the context bottleneck from two sides—LaTexT minimizes the token footprint of thinking, while Reasoning Memory maximizes the efficacy of each token by reusing accumulated strategies—pointing toward agents that reason not just longer, but smarter. Bio [Expand] Scott Wen-tau Yih is a Research Scientist at Meta Fundamental AI Research (FAIR) and an affiliate professor at the University of Washington. Prior to joining Meta, he was a Principal Research Scientist at the Allen Institute for Artificial Intelligence (2017–2019) and a Senior Researcher at Microsoft Research (2005–2017). He received his PhD from the University of Illinois at Urbana-Champaign in 2005. His research interests span natural language processing, machine learning, and information retrieval, with a recent focus on neural retrieval models and retrieval-augmented generation to enhance factuality in text and multimodal language models. He has published numerous influential papers, including WikiQA, DPR, and RAG, that have garnered significant recognition, and received the Best Paper Award at CoNLL-2011 and the Outstanding Paper Award at ACL-2015. He has also served in various leadership roles in the NLP and ML communities, including as program co-chair for CEAS-2009, CoNLL-2014, and EMNLP-2021, and as senior area chair at major NLP (ACL, NAACL, EMNLP, EACL) and ML (ICLR, NeurIPS) conferences. In 2024, he was selected as an ACL Fellow for “significant contributions to information extraction and question answering, neural retrieval and retrieval-augmented generation.”
9:30 AM - 10:00 AM MT	MindCube winner presentation
10:00 AM - 10:30 AM MT		Invited Talk Kate Saenko Title: SAM3: Open-vocabulary object detection and segmentation with Segment Anything Model 3 Abstract [Expand] TODO Bio [Expand] Kate is an AI Research Scientist at MSL, Meta and a Full Professor of Computer Science at Boston University (currently on leave) where she leads the Computer Vision and Learning Group. Kate received a PhD in EECS from MIT and did postdoctoral training at UC Berkeley and Harvard. Her research interests are in Artificial Intelligence with a focus on out-of-distribution learning, dataset bias, domain adaptation, vision and language understanding, and other topics in deep learning.
10:30 AM - 11:00 AM MT		Invited Talk Chelsea Finn Title: TODO Abstract [Expand] TODO Bio [Expand] TODO
11:00 AM - 11:30 AM MT		Invited Talk Mohit Bansal Title: Memory, Action, and Skill Planning for Multimodal Agents Abstract [Expand] TODO Bio [Expand] Dr. Mohit Bansal is the John R. & Louise S. Parker Distinguished Professor, Director of the MURGe-Lab (UNC-AI Group), and Core AI Lead of the ENGAGE NSF-AI Institute in the Computer Science department at the University of North Carolina (UNC) Chapel Hill. He received his Ph.D. from the University of California at Berkeley (where he was advised by Dan Klein) and his B.Tech. from the Indian Institute of Technology at Kanpur. His research expertise is in multimodal generative models, reasoning and planning agents, faithful language generation, and interpretable, efficient, and generalizable deep learning. He is an ACL and AAAI Fellow and recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE), IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, CoNLL, and TMLR. He has been a keynote speaker for the IEEE/CVF WACV 2027, IEEE MLSP 2026, ECAI 2025, ACM-CODS 2025, AACL-IJCNLP 2023, CoNLL 2023, and INLG 2022 conferences. His service includes EMNLP Program Co-Chair, Associate Editor-in-Chief for TPAMI, CoNLL Program Co-Chair, ACL Executive Committee, ACM Doctoral Dissertation Award Committee, ACL Doctoral Dissertation Award Co-Organizer, ACL Mentorship Program Co-Founder, and Associate Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals.
11:30 AM - 12:00 AM MT	Panel Discussion + Closing Remarks	Moderator Jianfeng Gao Bio [Expand] TODO

Call for Papers

We welcome original contributions that advance the state of the art in vision-language learning, multimodal perception, and embodied AI, particularly in unconstrained, real-world environments. Topics of interest include, but are not limited to:

LMMs & Vision-Language Systems: Open-vocabulary learning, multimodal pretraining, and adaptation.
Video Understanding & Temporal Reasoning: Long-range video modeling, causal reasoning, and instruction-following.
VLA & Embodied AI: Multimodal action learning, simulation-to-real transfer, and robotic perception.
Foundation Models for Vision Tasks: Object detection, segmentation, tracking, and fine-grained recognition in the wild.
Efficient Training Methods: Large visual model adaptation methods, measured by #training samples (zero-shot and few-shot), #trainable parameters, throughput, training cost
New Metrics and Benchmarks: Novel ways to evaluate existing LMMs and large vision models for task-level transfer and open-set visual recognition.

We accept abstract submissions to our workshop. All submissions shall have maximally 8 pages (excluding references) following the CVPR 2025 author guidelines. All submissions will be reviewed by the Program Committee on the basis of technical quality, relevance to scope of the conference, originality, significance, and clarity. Ther review process is double-blind, and the accepted papers are NOT archived in CVPR 2025 Proceeding.

Workshop Paper Submission Portal: [Open Review] Submission Deadline: April 15th, 2026 Acceptance Notification: May 10th, 2026 Camera-ready Submission May 30th, 2026

For more information about the paper submission, please reach out the workshop organizers.

Call for Challenge Submissions

We introduce two new challenges to evaluate the performance of large vision models in the wild:

Challenge	Task	Eval Metrics	Instructions	Make a Challenge Submission
MindCube	Spatial Mental Model Reasoning	Accuracy		Leaderboard
SITE-Bench	Spatial Intelligence Thorough Evaluation	Chance-Adjusted Accuracy		Leaderboard