This repository provides the data, tools, and code to download, explore, and utilize the TrackVerse dataset.
The TrackVerse dataset is a large-scale collection of 31.9 million object tracks, each capturing the motion and appearance of an object over time. These tracks are automatically extracted from YouTube videos using state-of-the-art object detection (DETIC) and tracking (ByteTrack) algorithms. The dataset spans 1203 object categories from the LVIS ontology, ensuring a diverse and long-tailed distribution of object classes.
TrackVerse is designed to ensure object-centricity, class diversity, and rich object motions and states. Each track is enriched with metadata, including bounding boxes, timestamps, and prediction labels, making it a valuable resource for research in object-centric representation learning, video analysis, and robotics.
In our paper, we explore the use of TrackVerse for learning unsupervised image representations. By introducing natural temporal augmentations—i.e., viewing an object across time and motion—TrackVerse enables models to learn fine-grained, state-aware representations that are more sensitive to object transformations and behaviors (See paper and for details).
🎁 Bonus: Our fully automated object track collection pipeline can be easily scaled up without any manual annotation. You can also create your own customized dataset of object tracks using different vocabularies, source videos, or curation strategies.
- [Oct 2025] Our fully automated object track collection pipeline is now publicly released!
- [July 2025] TrackVerse dataset and download scripts are now publicly released!
- [June 2025] 🎉 Our paper TrackVerse has been accepted to ICCV 2025 🌺
Stay tuned for future updates and improvements!
TrackVerse is released as a collection of object track metadata stored in JSONL files, where each line represents a single track with the following fields:
metadata keys
track_id: Unique ID for the tracktrack_ts: Start and end timestamps of the track (seconds) in the original videoframe_ts: Timestamps for each frame in the track (seconds) in the original videoframe_bboxes: Bounding boxes[x, y, width, height]for each frameyid: YouTube video IDtrack_mp4_filename: Local filename of the track videotop10_label_ids: Top-10 predicted class IDstop10_label_names: Top-10 predicted class names
To support diverse research needs, we provide the full TrackVerse dataset, curated subsets at various scales to ensure more balanced class distributions, and a human-verified validation set for in-domain evaluation:
| Subset | #Tracks | Max Tracks per Class | Link |
|---|---|---|---|
| Full TrackVerse | 31.9M | --- | Coming soon. |
| 82K-CB100 | 82K | 100 | 🤗 Link |
| 184K-CB300 | 184K | 300 | 🤗 Link |
| 259K-CB500 | 259K | 500 | 🤗 Link |
| 392K-CB1000 | 392K | 1000 | 🤗 Link |
| 1121K-CB2500 | 1.1M | 2500 | 🤗 Link |
| 3778K-CB8000 | 3.8M | 8000 | 🤗 Link |
| Validation Set | 4188 | 6 | Link |
For detailed instructions on extracting TrackVerse from the JSONL files, refer to the download guide.
You can also create your own customized dataset of object tracks, for example, using different vocabulary, different source videos or different curation strategies.
- Set Up the Environment: Refer to the install guidelines for detailed instructions.
- Clone the Repository:
git clone --recurse-submodules https://github.com/MMPLab/TrackVerse.git - Follow the Pipeline: Follow the detailed steps outlined in our pipeline documentation.
For support or inquiries, please open a GitHub issue. If you have questions about technical details or need further assistance, feel free to reach out to us directly.
All code and data in this repo are available under the MIT License for research purposes only.
Please consider giving a star ⭐ and citing our paper if you find this repo useful:
@InProceedings{Wei_2025_ICCV,
author = {Wei, Yibing and Church, Samuel and Suciu, Victor and Lin, Jinhong and Wu, Cheng-En and Morgado, Pedro},
title = {TrackVerse: A Large-Scale Object-Centric Video Dataset for Image-Level Representation Learning},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {11153-11163}
}