🚀🚀 Welcome to the repo of F-16!
F-16 is a powerful video large language model (LLM) that perceives high-frame-rate videos, which is developed by the Department of Electronic Engineering at Tsinghua University and ByteDance.
- 2025-07-03: We release the final checkpoint of F-16.
- 2025-06-18: We release the code of F-16.
Release the code.Release final F-16.
- Prepare the dataset following
scripts/example_sft.json. - Download LLaVA-OneVision Model from huggingface.
- Modify the parameters in
scripts/train_sft.sh. - Run
bash scripts/train_sft.sh.
- Prepare the dataset following
scripts/example_sft.json. - Modify the parameters in
scripts/eval.sh. - Run
bash scripts/eval.sh.
Team Tsinghua: Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Chao Zhang
Team ByteDance: Wei Li, Zejun Ma
If you find F-16 useful, please cite the paper:
@inproceedings{li2025improving,
title={Improving LLM Video Understanding with 16 Frames Per Second},
author={Li, Yixuan and Tang, Changli and Zhuang, Jimin and Yang, Yudong and Sun, Guangzhi and Li, Wei and Ma, Zejun and Zhang, Chao},
booktitle={Proc. ICML},
year={2025},
address={Vancouver}
}