This repository hosts the project pages for our research on visual cognition capabilities in Multimodal Large Language Models (MLLMs).
[COLM 2025]
📝 arXiv • 📄 PDF • 🤗 Benchmark • 🌐 Project Page
Xu Cao1*, Yifan Shen1*, Bolin Lai2, Wenqian Ye3, Yunsheng Ma4, Joerg Heintz1, Jintai Chen5, Meihuan Huang6, Jianguo Cao6, Aidong Zhang3, James M. Rehg1
1University of Illinois at Urbana-Champaign, 2Georgia Institute of Technology, 3University of Virginia, 4Purdue University, 5HKUST (Guangzhou), 6Shenzhen Children's Hospital
*Equal Contribution
Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children.
Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual cognition studies. Based on the training data of MaRs-VQA, we also finetune a baseline model Qwen2-VCog with multi-stage cognition reasoning annotations.
@misc{cao2024visualcognitiongaphumans,
title={What is the Visual Cognition Gap between Humans and Multimodal LLMs?},
author={Xu Cao and Bolin Lai and Wenqian Ye and Yunsheng Ma and Joerg Heintz and Jintai Chen and Jianguo Cao and James M. Rehg},
year={2024},
eprint={2406.10424},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.10424},
}[2026]
📝 arXiv • 📄 PDF • 🤗 Benchmark • 🤗 Model • 🌐 Project Page
Boyi Li1*, Yifan Shen1*§, Yuanzhe Liu1*, Yifan Xu1, Jiateng Liu1, Xinzhuo Li1, Zhengyuan Li1, Jingyuan Zhu2, Yunhan Zhong1, Fangzhou Lan2, Jianguo Cao2, James M. Rehg1, Heng Ji1, Ismini Lourentzou1†, Xu Cao1,2†
1University of Illinois at Urbana-Champaign, 2PediaMed AI
*Equal Contribution, §Project Lead, †Corresponding Author
Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuo-spatial sketchpad and visual imagery.
To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent.
To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.
@misc{li2026cognitivesupersensingmultimodallarge,
title={Toward Cognitive Supersensing in Multimodal Large Language Model},
author={Boyi Li and Yifan Shen and Yuanzhe Liu and Yifan Xu and Jiateng Liu and Xinzhuo Li and Zhengyuan Li and Jingyuan Zhu and Yunhan Zhong and Fangzhou Lan and Jianguo Cao and James M. Rehg and Heng Ji and Ismini Lourentzou and Xu Cao},
year={2026},
eprint={2602.01541},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.01541},
}PediaMed AI • University of Illinois at Urbana-Champaign