Egocentric Vision · Benchmark · 2026

CoMind
A Dual-View Egocentric Dataset for Interaction Understanding

Every session is captured simultaneously from two synchronized head-mounted cameras, paired with two exocentric views as well as dense scene scans and high-resolution object scans — enabling a new class of multi-view egocentric tasks.

CoMind modalities — paired egocentric and exocentric video, scene scans, and object scans
0
Sessions
0
Ego Video Hours
0
Scanned Kitchens
0
Scanned Objects
0
Benchmark Tasks
01 — Video

Overview Video

A short walkthrough of CoMind: dual-view egocentric capture, scene and object scans, and the three benchmark tasks.

02 — Signature Feature

Two Egocentric Views, One Moment

Every session is recorded by two participants wearing synchronized capture glasses. Both streams are frame-aligned, hardware-timestamped, and come with shared scene & object scans.

Ego View — Subject A
Ego View — Subject B
00:00 / 00:00
03 — 3D Assets

Scene & Object Scans

Beyond video, every session ships with the 3D context it was captured in.

Scene Scan

Dense Scene Reconstructions

Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.

  • Colored .ply
  • Camera extrinsics per frame
  • 53 unique environments
Object Scan

High-Resolution Object Scans

Interacted objects are independently scanned at sub-millimeter resolution, enabling 6-DoF pose supervision and rendering-based evaluation.

  • Watertight meshes
  • Scale-accurate, oriented upright
  • Objects reused across recordings
04 — Benchmarks

Three Tasks Testing Social Reasoning

Tasks constructed using realistic social scenarios, requiring complex reasoning about social cues in the provided speech transcripts and context frames.

T1

Joint Attention Estimation

Given two synchronized ego views at a moment of joint attention, the model predicts the category of the jointly attended object, its bounding box in the left and right views, and the type of social cue that can be used to infer which object is being attended to.

Multi-viewGrounding in Two ViewsObject Category PredictionSocial Cue Detection
T2

Socially Conditioned Object Interaction Anticipation

From context frames and transcribed audio together with a prediction frame, the model predicts the noun and verb of the action to be performed, the interacted object's bounding box, and the type of social cue that can be used to infer the next interacted object.

Single-ViewGroundingObject Category PredictionAction PredictionSocial Cue Detection
T3

Collaborative Handover Prediction

From context frames and transcribed audio together with a prediction frame showing a moment directly preceding a handover event, the model predicts the delivery flow (who hands to whom), the object category and bounding box in the view of the handing participant, the initiator, and the cue type.

Multi-viewGroundingObject Category PredictionSocial Cue DetectionHandover Initiator DetectionHandover Flow Detection
05 — Access

Download

Coming soon. Please stay tuned!

06 — BibTeX

Citation

@article{comind2026,
  title     = {{CoMind: Understanding Collaborative Human Activity from Multiple Minds and Views}},
  author    = {Gavryushin, Alexey and Zhang, Dingxi and Huang, Zhao and Delitzas, Alexandros and Chen, Jiaqi and Ellis, Ben and Z{\"o}llner, Cedric and Patel, Manthan and Kaufmann, Manuel and Pollefeys, Marc and Wang, Xi},
  year      = {2026}
}