Corresponding author: Yichi Zhang
Paper | Project Page | Dataset | Model
- Clone the repo
git clone https://github.com/pro-assist/ProAssist.git
cd ProAssist- (Optional) Create a virtual environment
conda create -n mm python=3.10 -y
conda activate mm- Install dependencies
pip install -r requirements.txt
pip install -e .- Set the data root dir in
mmassist/configs/arguments.py, or exportDATA_ROOT_DIRin your environment.
export DATA_ROOT_DIR=<your_data_root_dir>
- Download the preprocessed data:
git lfs install
git clone https://huggingface.co/594zyc/ProAssist-Dataset
mv ProAssist-Dataset/processed_data $DATA_ROOT_DIR/processed_data
Note: the preprocessed data is 152 GB with many files, so it is slow to download. To download a subset of the data for preview, you can use the following command:
git lfs install
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/594zyc/ProAssist-Dataset
git lfs pull -I "processed_data/wtag" # will only download the wtag subset
- Unzip the data:
for dataset in ego4d holoassist epickitchens egoexolearn wtag assembly101; do
cd $DATA_ROOT_DIR/processed_data/$dataset
unzip generated_dialogs.zip
unzip prepared.zip
done
If you want to prepare the data from scratch using the LLM-based data generation pipeline, please see here.
cd $DATA_ROOT_DIR
mkdir -p models && cd models
# download the I=1 model (1 token per frame)
git clone https://huggingface.co/594zyc/ProAssist-Model-L4096-I1
# download the I=5 model (5 tokens per frame)
git clone https://huggingface.co/594zyc/ProAssist-Model-L4096-I5
# download the I=10 model (10 tokens per frame)
git clone https://huggingface.co/594zyc/ProAssist-Model-L4096-I10
We provide several notebooks to demonstrate:
- Video and dialogue visualization (link)
- Model inference for streaming video-to-dialogue generation (link)
- LLM-based dialogue generation pipeline (link)
- LLM-as-a-judge evaluation (link)
- Dataset statistics overview (link)
Note: the training and evaluation scripts only work with the slurm cluster currently.
# Train the I=1, 5, 10 model (I=#tokens/frame)
sbatch scripts/train/I1_8n_4096_1s.sh
sbatch scripts/train/I5_12n_4096_1s.sh
sbatch scripts/train/I10_16n_4096_1s.sh
# Evaluate a trained model
sbatch scripts/eval/Aug_eval_stream.sh
Please consider citing our paper if you find this project helpful for your research:
@article{zhang2025proactive,
title={Proactive Assistant Dialogue Generation from Streaming Egocentric Videos},
author={Zhang, Yichi and Dong, Xin Luna and Lin, Zhaojiang and Madotto, Andrea and Kumar, Anuj and Damavandi, Babak and Chai, Joyce and Moon, Seungwhan},
journal={arXiv preprint arXiv:2506.05904},
year={2025}
}