ReLearn @ CVPR 2026

Overview

As AI capabilities surge, ReLearn asks whether machines can still learn from humans and how cognitive science can shape the next generation of systems.

We bridge computer vision, cognitive science, and psychology to study reasoning, social understanding, and hybrid learning that blends human insight with autonomous discovery.

Key Themes

Human-inspired foundations of reasoning and Theory of Mind
Learning with humans via feedback and interaction
Beyond the human blueprint through self-supervision

Important Dates

Feb 20, 2026

CVPR final decisions to authors

Mar 18, 2026

Workshop — paper submission deadline

Mar 25, 2026

Workshop — Notification to authors

Apr 10, 2026

Workshop — Camera-ready deadline

Jun 3, 2026

Workshop date

Invited Speakers

Alexei (Alyosha) Efros

UC Berkeley

Alexei (Alyosha) Efros is a Professor in the Department of Electrical Engineering and Computer Sciences (EECS) at UC Berkeley. Prior to that, he was on the faculty of Carnegie Mellon University. His research is in the area of computer vision and computer graphics, especially at the intersection of the two. He is particularly interested in using data-driven techniques to tackle problems where large quantities of unlabeled visual data are readily available. He is a recipient of the CVPR Best Paper Award (2006), Sloan Fellowship (2008), Guggenheim Fellowship (2008), Okawa Grant (2008), SIGGRAPH Significant New Researcher Award (2010), three PAMI Helmholtz Test-of-Time Prizes (1999, 2003, 2005), the ACM Prize in Computing (2016), Diane McEntyre Award for Excellence in Teaching Computer Science (2019), Jim and Donna Gray Award for Excellence in Undergraduate Teaching of Computer Science (2023), and the PAMI Thomas S. Huang Memorial Prize (2023).

Dima Damen

University of Bristol / Google DeepMind

Dima Damen is a Professor of Computer Vision at the University of Bristol and Senior Research Scientist at Google DeepMind. Dima is currently an EPSRC Fellow (2020-2026), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She is best known for her leading works in Egocentric Vision, and has also contributed to novel research questions including mono-to-3D, video object segmentation, assessing action completion, domain adaptation, skill and expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning, as well as multi-modal fusion using vision, audio, and language.

Saining Xie

NYU Courant

Saining Xie is an Assistant Professor of Computer Science at NYU Courant and part of the CILVR group. He is also affiliated with the NYU Center for Data Science. Before that he was a research scientist at Facebook AI Research (FAIR), Menlo Park. He received his Ph.D. and M.S. degrees from the CSE Department at UC San Diego, advised by Zhuowen Tu. During his PhD study, he also interned at NEC Labs, Adobe, Facebook, Google, and DeepMind. Prior to that, he obtained his bachelor degree from Shanghai Jiao Tong University. His primary areas of interest in research are computer vision and machine learning.

Manling Li

Northwestern University

Title: How Foundation Models Build (and Fail to Build) Spatial Minds: A Piagetian View

Abstract: Spatial cognition is a developmental capacity that, in humans, unfolds in stages. This talk asks how far foundation models have traveled along the same path. Following Piaget's account of spatial development, I will trace three layers: topological reasoning over invariants that survive deformation (MindTopo), projective reasoning that constructs beliefs about unseen structure through active exploration (Theory of Space), and metric reasoning that maintains a coherent mental map from only a few views (MindCube). Read through this lens, today's models show a consistent dissociation: they name spatial structure in a static scene yet fail to preserve or act on it once the world moves (ENACT). I will suggest that the missing ingredient is not sharper perception, but a structured and updatable model of space.

Manling Li is an Assistant Professor at Northwestern University and an Amazon Scholar. She was a postdoc at Stanford University, and obtained the PhD degree in Computer Science at University of Illinois Urbana-Champaign in 2023. She works on Reasoning, Planning and Compositionality, in the intersection of Language, Vision, and Robotics. Her work has been recognized as ACL 2025 Dissertation Award Honorable Mention, Outstanding Paper Award at ACL’24, Best Demo Paper Award at NAACL’21 and ACL’20, Best Paper Awards at NeurIPS/ICCV/RSS workshops, MIT Tech Review 35 Innovators Under 35, Microsoft Research PhD Fellowship, EE CS Rising Star, etc. She led the tutorials/workshops/challenges of Foundation Models meet Embodied Agents. Additional information is available at limanling.github.io.

Alan Yuille

Johns Hopkins University

Alan Yuille received the BA degree in mathematics from the University of Cambridge in 1976. His PhD on theoretical physics, supervised by Prof. S.W. Hawking, was approved in 1981. He was a research scientist in the Artificial Intelligence Laboratory at MIT and the Division of Applied Sciences at Harvard University from 1982 to 1988. He served as an assistant and associate professor at Harvard until 1996. He was a senior research scientist at the Smith-Kettlewell Eye Research Institute from 1996 to 2002. He was a full professor of Statistics at the University of California, Los Angeles, as a full professor with joint appointments in computer science, psychiatry, and psychology. He moved to Johns Hopkins University in January 2016. His research interests include computational models of vision, mathematical models of cognition, medical image analysis, and artificial intelligence and neural networks.

William T. Freeman

MIT CSAIL / Google Research

# Joint presentation with Eric Li, senior graduate student.

William T. Freeman is the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science (EECS) at MIT, and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) there. He was the Associate Department Head of EECS from 2011 – 2014. Since 2015, he has also been a research manager in Google Research in Cambridge, MA.

His current research interests include mid-level vision and computational photography. Previous research topics include steerable filters and pyramids, orientation histograms, the generic viewpoint assumption, color constancy, computer vision for computer games, motion magnification, and belief propagation in networks with loops. He received outstanding paper awards at computer vision or machine learning conferences in 1997, 2006, 2009, 2012 and 2019, and test-of-time awards for papers from 1990, 1995, 2002, 2005, and 2012. He shared the 2020 Breakthrough Prize in Physics for a consulting role with the Event Horizon Telescope collaboration, which reconstructed the first image of a black hole. He is a member of the National Academy of Engineering, and a Fellow of the IEEE, ACM, and AAAI. In 2019, he received the PAMI Distinguished Researcher Award, the highest award in computer vision. He is co-author of the computer vision textbook, https://visionbook.mit.edu/, also available through MIT Press.

Presented Papers

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, and Katherine Driggs-Campbell

G-RoLA: A Generative World Model Paradigm for Robotic Skill Acquisition from Any Image

Chenkai Gao and Yina Jian

UniVerse: Empower Unified Generation with Reasoning and Knowledge

Kaiyue Sun, Weiyang Jin, Chengqi Duan, Rongyao Fang, Xian Liu, Yuwei Niu, Chunwei Wang, Aoxue Li, and Xihui Liu

MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents

Ruoxuan Zhang, Qiyun Zheng, Zhiyu Zhou, Ziqi Liao, Siyu Wu, Jian-Yu Jiang-Lin, Bin Wen, Hongxia Xie, Jianlong Fu, and Wen-Huang Cheng

Spot The Ball: A Benchmark for Visual Social Inference

Neha Balamurugan, Sarah Wu, Cristobal Eyzaguirre, and Tobias Gerstenberg

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, and Mohamed Elhoseiny

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines

Yusen Cai, Qing Lin, Bhargava Satya Nunna, and Mengmi Zhang

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Shengao Wang, Wenqi Wang, Zecheng Wang, Max Whitton, Michael Wakeham, Arjun Chandra, Joey Huang, Pengyue Zhu, Helen Chen, David Li, Jeffrey Li, Shawn Li, Andrew Zagula, Amy Zhao, Andrew Zhu, Sayaka Nakamura, Yuki Yamamoto, Jerry Jun Yokono, Aaron Mueller, Bryan A. Plummer, Kate Saenko, Venkatesh Saligrama, and Boqing Gong

Schedule

Wednesday, June 3, 2026 · Mile High 4AB · PM slot (afternoon)

13:20 Welcome & introduction

13:30 Keynote 1 — Alexei (Alyosha) Efros

14:00 Keynote 2 — Manling Li

14:30 Keynote 3 — Dima Damen

15:00 Presentations of challenge winners

15:10 Posters & coffee break

16:00 Keynote 4 — Alan Yuille

16:30 Keynote 5 — Saining Xie (remote)

17:00 Keynote 6 — William T. Freeman, with Eric Li

17:30 Closing remarks

Call for Papers

We invite papers aligned with the workshop themes, spanning human-inspired foundations, learning with humans, and hybrid intelligence.

Submissions will follow CVPR 2026 formatting and length guidelines. Accepted papers will be presented in oral/spotlight and poster formats.

Submission Guidelines

We invite submissions of a maximum of 8 pages, excluding references, using the CVPR template. Submissions should follow CVPR 2026 instructions. All papers will be subject to a double-blind review process, i.e. authors must not identify themselves on the submitted papers. The reviewing process is single-stage without rebuttals.

Online Submission System: OpenReview
Submission Format: CVPR template (double column; no more than 8 pages, excluding reference). Submissions are anonymous and should not include any author names, affiliations, and contact information in the PDF.

If you have any questions, feel free to reach out to us.

Challenge

Multimodal Theory of Mind (ToM) Challenge: infer goals and beliefs from videos, textual scene descriptions, and dialogues.

The challenge includes two tracks: single-agent reasoning and multi-agent reasoning.

Challenge website: relearnchallenge.onrender.com.

Use this ReLearn workshop site for workshop updates and paper submissions, and the challenge site for team registration, benchmark resources, and Track 1/Track 2 submissions.

Track 1

Reasoning from a single agent's behavior.

Track 2

Reasoning from multi-agent interactions.

Timeline

Opens Feb 23, 2026 · Submissions due May 3, 2026.