Skip to content

YU-deep/VisMem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

To Do List

[2026/02/04] We replace the results of Math-Vision with MathVista [Done].

🌟🌟🌟 Method

This repo is the official implementation of: VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models.

Drawing inspiration from human cognitive memory theory, we propose a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation.

fig6

🫡🫡🫡 Citation

@article{yu2025vismem,
  title={Vismem: Latent vision memory unlocks potential of vision-language models},
  author={Yu, Xinlei and Xu, Chengming and Zhang, Guibin and Chen, Zhangquan and Zhang, Yudong and He, Yongbo and Jiang, Peng-Tao and Zhang, Jiangning and Hu, Xiaobin and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2511.11007},
  year={2025}
}

👍👍👍 Quick Start

(1) Installation

conda create -n main python=3.10 -y
conda activate main
pip install -r requirements.txt

(2) Training

Recommended GPU: >= 8 NVIDIA H200 141G GPUs.

Stage I

python -m main.cli.train_stage1 \
  --model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
  --train_jsonl /path/to/train.jsonl \
  --output_dir outputs/stage1 \
  --epochs 1

Stage II

python -m main.cli.train_stage2 \
  --model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
  --train_jsonl /path/to/train.jsonl \
  --init_from outputs/stage1 \
  --output_dir outputs/stage2 \
  --epochs 1

(3) Evaluation

All datasets should use JSONL with fields, using "/data/jsonl_dataset.py". And utilize the inference process:

python -m main.cli.infer \
  --model path_to_model \
  --samples path_to_samples \
  --max_new_tokens 256 \
  --enable_vismem

🔥🔥🔥 Results

Main Comparisons

fig6

Results on Various Base Models

fig6

Cross-domain Generalization

fig6

Catastrophic Forgetting Mitigation

fig6

Dynamic Memory Invocation

fig6

Efficiency Analysis

fig6

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors