Skip to content

bytedance/SALMONN

 
 

Repository files navigation

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Inference

Preparation

The python version is 3.9.20, and other required packages can be installed with the following command:

pip install -r requirements.txt

Create directory to store checkpoints (If modify the structure/rename directories, need to change config files and model files accordingly)

mkdir -p ckpt/MultiResQFormer
mkdir -p ckpt/pretrained_ckpt

Then download the following model checkpoints:

  1. Main video-SALMONN model checkpoint, then put it under ckpt/MultiResQFormer
  2. InstructBLIP checkpoint for Vicuna-13B model, then put it under ckpt/pretrained_ckpt
  3. EVA_VIT model checkpoint for InstructBLIP, then put it under ckpt/pretrained_ckpt
  4. BEATs encoder checkpoint, then put it under ckpt/pretrained_ckpt

Run inference

python inference.py --cfg-path config/test.yaml 

Check the result

The result is saved in the following path:

./ckpt/MultiResQFormer/<DateTime>/eval_result.json

Expecting the following result:

[
    {
        "id": "./dummy/4405327307.mp4_Describe the video and audio in detail",
        "conversation": [
            {
                "from": "human",
                "value": "Describe the video and audio in detail"
            },
            {
                "from": "gpt",
                "value": "None"
            }
        ],
        "task": "audiovisual_video_input",
        "ref_answer": "None",
        "gen_answer": "The video shows a group of musicians performing on stage, with a man singing into a microphone and playing the piano. There is also a drum set and a saxophone on stage. The audience is not visible in the video. The music is upbeat and energetic, and the performers seem to be enjoying themselves.</s>"
    }
]

License & CODE_OF_CONDUCT

Please refer to salmonn branch for more details.

✨ Citation

If you find video-SALMONN useful, please cite the paper:

@inproceedings{
  sun2024videosalmonn,
  title={video-{SALMONN}: Speech-Enhanced Audio-Visual Large Language Models},
  author={Guangzhi Sun and Wenyi Yu and Changli Tang and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Yuxuan Wang and Chao Zhang},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=nYsh5GFIqX}
}