by Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne.
📚 arXiv preprint | 🖥️ Project webpage
- 📰 Our work was featured on MIT News!
- ✨ The paper has been accepted at CVPR 2025!
- 🚀 Code and pretrained models for retrieval on VGGSound are here!
- 🚧 Classification/Localization codes and models coming soon!
Before running the code, you need to install the required Python libraries. You can do this using either a virtual environment with pip or conda.
# Create a virtual environment
python3 -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install the required packages
pip install -r requirements.txt# Create a conda environment
conda create -n cav-mae-sync python=3.7
# Activate the conda environment
conda activate cav-mae-sync
# Install the required packages
pip install -r requirements.txtFor preparing your data (audio, frames, and label files), follow the original CAV-MAE instructions here:
https://github.com/YuanGongND/cav-mae#data-preparation
This ensures your data is in the correct format for CAV-MAE Sync. No changes are needed, just use the same process as the original repo.
To perform retrieval tasks, you first need to download the pretrained models and generate the necessary data files.
Navigate to the pretrained_models directory and execute the script:
cd pretrained_models
sh get_pretrained_model.shThis will download the cav_mae_sync.pth file.
Navigate to the datafiles directory and run the script:
cd ../datafiles # Assuming you are in pretrained_models, otherwise adjust path
sh generate_datafiles.shThis script will prompt you to enter the path to your VGGSound dataset. It will then use this path to generate the final data JSON files from the templates.
To run retrieval (example on VGGSound subset):
python src/retrieval.py --nums_samples 1600 --directions audio video --strategy diagonal_meanWhere:
--nums_samplesis the number of samples to evaluate.--directionsis the direction of the retrieval task.{audio, video}--strategyis the strategy to use for retrieval.{diagonal_mean, diagonal_max, mean, max}
Dataset Size Note: The VGGSound subset used here contains about 1520 samples. If you set
--nums_samplesgreater than this, the script will just use the entire dataset (i.e., all available samples are evaluated).
Note: Computing the similarity matrix in this (non-parallelized) code can take up to 40 minutes for the full retrieval result. For quick tests, reduce
--nums_samplesto a smaller value (e.g., 100 or 500). Alternatively, it is possible to optimize this large matrix multiplication. (Gladly accepting PRs!)
If you use CAV-MAE Sync, please cite:
@inproceedings{araujo2025cavmaesync,
title = {CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment},
author = {Araujo, Edson and Rouditchenko, Andrew and Gong, Yuan and Bhati, Saurabhchand and Thomas, Samuel and Kingsbury, Brian and Karlinsky, Leonid and Feris, Rogerio and Glass, James R.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025}
}