Cognition-MLLM: Visual Cognition in Multimodal LLMs

This repository hosts the project pages for our research on visual cognition capabilities in Multimodal Large Language Models (MLLMs).

📄 Paper 1: What is the Visual Cognition Gap between Humans and Multimodal LLMs?

[COLM 2025]

📝 arXiv • 📄 PDF • 🤗 Benchmark • 🌐 Project Page

Authors

Xu Cao^1*, Yifan Shen^1*, Bolin Lai², Wenqian Ye³, Yunsheng Ma⁴, Joerg Heintz¹, Jintai Chen⁵, Meihuan Huang⁶, Jianguo Cao⁶, Aidong Zhang³, James M. Rehg¹

¹University of Illinois at Urbana-Champaign, ²Georgia Institute of Technology, ³University of Virginia, ⁴Purdue University, ⁵HKUST (Guangzhou), ⁶Shenzhen Children's Hospital

^*Equal Contribution

Abstract

Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children.

Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual cognition studies. Based on the training data of MaRs-VQA, we also finetune a baseline model Qwen2-VCog with multi-stage cognition reasoning annotations.

Citation

@misc{cao2024visualcognitiongaphumans,
      title={What is the Visual Cognition Gap between Humans and Multimodal LLMs?}, 
      author={Xu Cao and Bolin Lai and Wenqian Ye and Yunsheng Ma and Joerg Heintz and Jintai Chen and Jianguo Cao and James M. Rehg},
      year={2024},
      eprint={2406.10424},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10424}, 
}

📄 Paper 2: Toward Cognitive Supersensing in Multimodal Large Language Models

[2026]

📝 arXiv • 📄 PDF • 🤗 Benchmark • 🤗 Model • 🌐 Project Page

Authors

Boyi Li^1*, Yifan Shen^1*§, Yuanzhe Liu^1*, Yifan Xu¹, Jiateng Liu¹, Xinzhuo Li¹, Zhengyuan Li¹, Jingyuan Zhu², Yunhan Zhong¹, Fangzhou Lan², Jianguo Cao², James M. Rehg¹, Heng Ji¹, Ismini Lourentzou^1†, Xu Cao^1,2†

¹University of Illinois at Urbana-Champaign, ²PediaMed AI

^*Equal Contribution, ^§Project Lead, ^†Corresponding Author

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuo-spatial sketchpad and visual imagery.

To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent.

To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.

Citation

@misc{li2026cognitivesupersensingmultimodallarge,
      title={Toward Cognitive Supersensing in Multimodal Large Language Model}, 
      author={Boyi Li and Yifan Shen and Yuanzhe Liu and Yifan Xu and Jiateng Liu and Xinzhuo Li and Zhengyuan Li and Jingyuan Zhu and Yunhan Zhong and Fangzhou Lan and Jianguo Cao and James M. Rehg and Heng Ji and Ismini Lourentzou and Xu Cao},
      year={2026},
      eprint={2602.01541},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.01541}, 
}

PediaMed AI • University of Illinois at Urbana-Champaign

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
cognition-mllm		cognition-mllm
cogsense		cogsense
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cognition-MLLM: Visual Cognition in Multimodal LLMs

📄 Paper 1: What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Authors

Abstract

Citation

📄 Paper 2: Toward Cognitive Supersensing in Multimodal Large Language Models

Authors

Abstract

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cognition-MLLM: Visual Cognition in Multimodal LLMs

📄 Paper 1: What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Authors

Abstract

Citation

📄 Paper 2: Toward Cognitive Supersensing in Multimodal Large Language Models

Authors

Abstract

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages