The python version is 3.9.20, and other required packages can be installed with the following command:
pip install -r requirements.txt
Create directory to store checkpoints (If modify the structure/rename directories, need to change config files and model files accordingly)
mkdir -p ckpt/MultiResQFormer
mkdir -p ckpt/pretrained_ckpt
Then download the following model checkpoints:
- Main video-SALMONN model checkpoint, then put it under
ckpt/MultiResQFormer - InstructBLIP checkpoint for Vicuna-13B model, then put it under
ckpt/pretrained_ckpt - EVA_VIT model checkpoint for InstructBLIP, then put it under
ckpt/pretrained_ckpt - BEATs encoder checkpoint, then put it under
ckpt/pretrained_ckpt
python inference.py --cfg-path config/test.yaml
The result is saved in the following path:
./ckpt/MultiResQFormer/<DateTime>/eval_result.json
Expecting the following result:
[
{
"id": "./dummy/4405327307.mp4_Describe the video and audio in detail",
"conversation": [
{
"from": "human",
"value": "Describe the video and audio in detail"
},
{
"from": "gpt",
"value": "None"
}
],
"task": "audiovisual_video_input",
"ref_answer": "None",
"gen_answer": "The video shows a group of musicians performing on stage, with a man singing into a microphone and playing the piano. There is also a drum set and a saxophone on stage. The audience is not visible in the video. The music is upbeat and energetic, and the performers seem to be enjoying themselves.</s>"
}
]
Please refer to salmonn branch for more details.
If you find video-SALMONN useful, please cite the paper:
@inproceedings{
sun2024videosalmonn,
title={video-{SALMONN}: Speech-Enhanced Audio-Visual Large Language Models},
author={Guangzhi Sun and Wenyi Yu and Changli Tang and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Yuxuan Wang and Chao Zhang},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=nYsh5GFIqX}
}