Seeing through the conversation: Audio-visual speech separation based on diffusion model (AVDiffuSS)
This repository contains the official PyTorch implementation for the paper:
Our demo page is here.
- Create a new virtual environment with the following command. You should change the environment path in yaml file.
conda create -n avdiffuss python=3.8pip install -r requirements.txt
- We provide pre-trained checkpoint for the model trained for 30 epochs on the VoxCeleb2 train dataset. It can be used for testing on both the VoxCeleb2 and LRS3 test datasets. The file can be downloaded here.
Usage:
- For evaluating the pre-trained checkpoint, use the
--testsetoption oftest.py(see section Evaluation below) for selecting the test dataset among VoxCeleb2 and LRS3. Use--ckptoption to specify the path of the checkpoint fortest.py.
For training, run
python train.py --base_dir /path/to/voxceleb2/dir --n_gpus NUM_GPUSIt you don't want to save checkpoints, add --nolog option.
To evaluate on a test set, run
python test.py --testset <'vox' or 'lrs3'> --ckpt /path/to/model/checkpoint --data_dir /path/to/test/data/directoryUse 'vox' for VoxCeleb2 test set, and 'lrs3' for LRS3 test set. You can get scores fast since train file only use the first 2.04s per audio for inference.
If you want to evaluate whole audio, please run
python test_whole.py --testset <'vox' or 'lrs3'> --ckpt /path/to/model/checkpoint --data_dir /path/to/test/data/directoryInference speed could be faster by changing --hop_length option. Default value is 0.04 which is same with VisualVoice.
The performance of the provided checkpoint evaluated by the first test command is as follows:
| testset | PESQ | ESTOI | SI-SDR |
|---|---|---|---|
| VoxCeleb2 | 2.5906 | 0.8152 | 12.2701 |
| LRS3 | 2.8106 | 0.8856 | 14.1707 |
Since our model is based on a Diffusion method, an inference speech would be slow. That's why we use the first 2.04s audio for checking our scores.
Our paper has been submitted to a conference and is currently under review. Therefore the appropriate citation for our paper may change in the future.
@inproceedings{lee2024seeing,
title={Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model},
author={Lee, Suyeon and Jung, Chaeyoung and Jang, Youngjoon and Kim, Jaehun and Chung, Joon Son},
booktitle={Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing},
year={2024}
}