CoMind Dataset: Understanding Collaborative Human Activity from Multiple Minds and Views

01 — Video

Overview Video

A short walkthrough of CoMind: dual-view egocentric capture, scene and object scans, and the three benchmark tasks.

02 — Signature Feature

Two Egocentric Views, One Moment

Every session is recorded by two participants wearing synchronized capture glasses. Both streams are frame-aligned, hardware-timestamped, and come with shared scene & object scans.

Ego View — Subject A

Ego View — Subject B

00:00 / 00:00

03 — 3D Assets

Scene & Object Scans

Beyond video, every session ships with the 3D context it was captured in.

Loading 3D preview…

drag to rotate

Scene Scan

Dense Scene Reconstructions

Photogrammetric scans of each recording environment, delivered as point clouds and aligned to ego-camera coordinates for direct reprojection.

Colored .ply
Camera extrinsics per frame
53 unique environments

Loading 3D preview…

drag to rotate

Object Scan

High-Resolution Object Scans

Interacted objects are independently scanned at sub-millimeter resolution, enabling 6-DoF pose supervision and rendering-based evaluation.

Watertight meshes
Scale-accurate, oriented upright
Objects reused across recordings

04 — Benchmarks

Three Tasks Testing Social Reasoning

Tasks constructed using realistic social scenarios, requiring complex reasoning about social cues in the provided speech transcripts and context frames.

T1

Joint Attention Estimation

Given two synchronized ego views at a moment of joint attention, the model predicts the category of the jointly attended object, its bounding box in the left and right views, and the type of social cue that can be used to infer which object is being attended to.

Multi-viewGrounding in Two ViewsObject Category PredictionSocial Cue Detection

T2

Socially Conditioned Object Interaction Anticipation

From context frames and transcribed audio together with a prediction frame, the model predicts the noun and verb of the action to be performed, the interacted object's bounding box, and the type of social cue that can be used to infer the next interacted object.

Single-ViewGroundingObject Category PredictionAction PredictionSocial Cue Detection

T3

Collaborative Handover Prediction

From context frames and transcribed audio together with a prediction frame showing a moment directly preceding a handover event, the model predicts the delivery flow (who hands to whom), the object category and bounding box in the view of the handing participant, the initiator, and the cue type.

Multi-viewGroundingObject Category PredictionSocial Cue DetectionHandover Initiator DetectionHandover Flow Detection

05 — Access

Download

Coming soon. Please stay tuned!

06 — BibTeX

Citation

@article{comind2026,
  title     = {{CoMind: Understanding Collaborative Human Activity from Multiple Minds and Views}},
  author    = {Gavryushin, Alexey and Zhang, Dingxi and Huang, Zhao and Delitzas, Alexandros and Chen, Jiaqi and Ellis, Ben and Z{\"o}llner, Cedric and Patel, Manthan and Kaufmann, Manuel and Pollefeys, Marc and Wang, Xi},
  year      = {2026}
}