NeurIPS D&B Track 2025
Authors:
Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan
This is the official repository of TalkCuts, a large-scale dataset for multi-shot human speech video generation.
- 2025-12-14: Dataset and processing code are released.
- 2025-10-08: Paper is on arXiv.
- 2025-09-18: TalkCuts is accepted to NeurIPS 2025!
Please fill in this form to request dataset access and obtain the download link:
After approval, you will receive a dataset link and a CSV file describing the video list.
Use the provided CSV file to download the videos:
python download_videos.py {csv_file_path} {target_folder_path}{csv_file_path}: the CSV file you received / prepared after access is granted{target_folder_path}: directory to save downloaded videos
This pipeline processes raw videos to generate high-quality segmented short clips and paired human poses for video generation training.
It involves scene detection, body detection, filtering, pose estimation, and visualization.
Raw Videos
|
| (0_scene_det.py)
v
Split Videos ---------------------+--------------------------+
| | |
| (1_body_det.py) | |
v | |
Body Detection Results | |
| | |
| (2_body_filter.py) | |
v | |
Filtered Det Results | |
| | |
+-----> (3_pose_det.py) <-----+ |
| |
v |
Pose Estimation Results |
| |
| (4_pose_filter.py) |
v |
Classified Pose Data |
| |
+-----> (5_draw_pose.py) <-------------------+
|
v
Visualization Videos
Function: Detects scene changes in videos using PySceneDetect and splits them into clips.
- Input: Folder with video files
- Output: Scene start frame indices (pickle) and split video clips
Env Installation:
pip install scenedetect[opencv] tqdm
# Install FFmpeg (Mac: brew install ffmpeg / Linux: sudo apt-get install ffmpeg)Run:
# Edit video_folder and output_folder in script
python 0_scene_det.py 0 1Function: Detects human bodies in videos using RTMDet.
Env Installation (Linux):
- Install PyTorch (see https://pytorch.org).
- Install MIM & MMLab:
pip install -U openmim mim install mmengine "mmcv>=2.0.0" "mmdet>=3.0.0" "mmpose>=1.0.0"
- Clone config:
git clone https://github.com/open-mmlab/mmpose.git
Weight Download:
wget https://download.openmmlab.com/mmpose/v1/projects/rtmposev1/rtmdet_m_8xb32-100e_coco-obj365-person-235e8209.pthFunction: Filters videos to keep only those where >80% frames contain exactly one person (bbox > 192px).
- Input: Detection pickles from Step 1
- Output: Filtered pickle files
Run:
# Edit in_path and out_path in script
python 2_body_filter.pyFunction: Extracts wholebody pose keypoints using DWPose for the filtered videos.
Env Installation:
- Clone DWPose:
git clone https://github.com/IDEA-Research/DWPose.git
- Dependencies: Same as Step 1 (
mmpose,mmdet,mmcv).
Weight Download:
wget "https://huggingface.co/yzd-v/DWPose/resolve/main/dw-ll_ucoco_384.pth?download=true" -O dw-ll_ucoco_384.pthRun:
# Edit paths (pickle_path, mp4_base_path, output_path, config)
python 3_pose_det.py 0 100Function: Classifies videos by pose quality (keypoint scores) into:
whole_body, half_body, head_body, low_quality.
- Input: Pose pickles from Step 3
- Output: Classification text files and cleaned pose pickles
Run:
# Edit path1, output_file, out_pose
python 4_pose_filter.pyFunction: Visualizes pose keypoints by overlaying them on original video frames.
- Input: Original/Split videos and (cleaned) pose pickles
- Output: Visualization videos
Env Installation:
pip install pillow avRun:
# Edit paths (video_root, pose_root, save_root)
python 5_draw_pose.py- The TalkCuts dataset is provided for research and non-commercial use only.
- By requesting access and/or using the dataset, you agree to:
- Use it only for research purposes (no commercial usage).
- Not redistribute the dataset (or any download links/credentials) to third parties.
- Follow applicable laws and ethical guidelines, including privacy and consent requirements.
- For commercial licensing or additional permissions, please contact the authors.
Note: This repository may include code under its own open-source license, while the dataset itself follows the above research / non-commercial terms.
If you find this dataset useful in your research, please cite:
@article{chen2025talkcuts,
title={TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation},
author={Chen, Jiaben and Wang, Zixin and Zeng, Ailing and Fu, Yang and Yu, Xueyang and Cen, Siyuan and Tanke, Julian and Chen, Yihang and Saito, Koichi and Mitsufuji, Yuki and others},
journal={arXiv preprint arXiv:2510.07249},
year={2025}
}