video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Inference

Preparation

The python version is 3.9.20, and other required packages can be installed with the following command:

pip install -r requirements.txt

Create directory to store checkpoints (If modify the structure/rename directories, need to change config files and model files accordingly)

mkdir -p ckpt/MultiResQFormer
mkdir -p ckpt/pretrained_ckpt

Then download the following model checkpoints:

Main video-SALMONN model checkpoint, then put it under ckpt/MultiResQFormer
InstructBLIP checkpoint for Vicuna-13B model, then put it under ckpt/pretrained_ckpt
EVA_VIT model checkpoint for InstructBLIP, then put it under ckpt/pretrained_ckpt
BEATs encoder checkpoint, then put it under ckpt/pretrained_ckpt

Run inference

python inference.py --cfg-path config/test.yaml

Check the result

The result is saved in the following path:

./ckpt/MultiResQFormer/<DateTime>/eval_result.json

Expecting the following result:

[
    {
        "id": "./dummy/4405327307.mp4_Describe the video and audio in detail",
        "conversation": [
            {
                "from": "human",
                "value": "Describe the video and audio in detail"
            },
            {
                "from": "gpt",
                "value": "None"
            }
        ],
        "task": "audiovisual_video_input",
        "ref_answer": "None",
        "gen_answer": "The video shows a group of musicians performing on stage, with a man singing into a microphone and playing the piano. There is also a drum set and a saxophone on stage. The audience is not visible in the video. The music is upbeat and energetic, and the performers seem to be enjoying themselves.</s>"
    }
]

License & CODE_OF_CONDUCT

Please refer to salmonn branch for more details.

✨ Citation

If you find video-SALMONN useful, please cite the paper:

@inproceedings{
  sun2024videosalmonn,
  title={video-{SALMONN}: Speech-Enhanced Audio-Visual Large Language Models},
  author={Guangzhi Sun and Wenyi Yu and Changli Tang and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Yuxuan Wang and Chao Zhang},
  booktitle={Forty-first International Conference on Machine Learning},
  year={2024},
  url={https://openreview.net/forum?id=nYsh5GFIqX}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
SAVEBench		SAVEBench
config		config
datasets		datasets
dummy		dummy
model		model
prompt		prompt
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
example.json		example.json
example_video.json		example_video.json
header.py		header.py
inference.py		inference.py
requirements.txt		requirements.txt
videosalmonn.yml		videosalmonn.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Inference

Preparation

Run inference

Check the result

License & CODE_OF_CONDUCT

✨ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

License

bytedance/SALMONN

Folders and files

Latest commit

History

Repository files navigation

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Inference

Preparation

Run inference

Check the result

License & CODE_OF_CONDUCT

✨ Citation

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Packages