Skip to content

PediaMedAI/Cognition-MLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cognition-MLLM: Visual Cognition in Multimodal LLMs

This repository hosts the project pages for our research on visual cognition capabilities in Multimodal Large Language Models (MLLMs).


📄 Paper 1: What is the Visual Cognition Gap between Humans and Multimodal LLMs?

[COLM 2025]

📝 arXiv📄 PDF🤗 Benchmark🌐 Project Page

Authors

Xu Cao1*, Yifan Shen1*, Bolin Lai2, Wenqian Ye3, Yunsheng Ma4, Joerg Heintz1, Jintai Chen5, Meihuan Huang6, Jianguo Cao6, Aidong Zhang3, James M. Rehg1

1University of Illinois at Urbana-Champaign, 2Georgia Institute of Technology, 3University of Virginia, 4Purdue University, 5HKUST (Guangzhou), 6Shenzhen Children's Hospital

*Equal Contribution

Abstract

Recently, Multimodal Large Language Models (MLLMs) and Vision Language Models (VLMs) have shown great promise in language-guided perceptual tasks such as recognition, segmentation, and object detection. However, their effectiveness in addressing visual cognition problems that require high-level multi-image reasoning and visual working memory is not well-established. One such challenge is matrix reasoning - the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns. This skill is crucial during the early neurodevelopmental stages of children.

Inspired by the matrix reasoning tasks in Raven's Progressive Matrices (RPM) and Wechsler Intelligence Scale for Children (WISC), we propose a new dataset MaRs-VQA to evaluate the visual cognition capability of MLLMs and compare their performance with existing human visual cognition studies. Based on the training data of MaRs-VQA, we also finetune a baseline model Qwen2-VCog with multi-stage cognition reasoning annotations.

Citation

@misc{cao2024visualcognitiongaphumans,
      title={What is the Visual Cognition Gap between Humans and Multimodal LLMs?}, 
      author={Xu Cao and Bolin Lai and Wenqian Ye and Yunsheng Ma and Joerg Heintz and Jintai Chen and Jianguo Cao and James M. Rehg},
      year={2024},
      eprint={2406.10424},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10424}, 
}

📄 Paper 2: Toward Cognitive Supersensing in Multimodal Large Language Models

[2026]

📝 arXiv📄 PDF🤗 Benchmark🤗 Model🌐 Project Page

Authors

Boyi Li1*, Yifan Shen1*§, Yuanzhe Liu1*, Yifan Xu1, Jiateng Liu1, Xinzhuo Li1, Zhengyuan Li1, Jingyuan Zhu2, Yunhan Zhong1, Fangzhou Lan2, Jianguo Cao2, James M. Rehg1, Heng Ji1, Ismini Lourentzou1†, Xu Cao1,2†

1University of Illinois at Urbana-Champaign, 2PediaMed AI

*Equal Contribution, §Project Lead, Corresponding Author

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require visual memory. Current approaches primarily scale Chain-of-Thought (CoT) reasoning in the text space, even when language alone is insufficient for clear and structured reasoning, and largely neglect visual reasoning mechanisms analogous to the human visuo-spatial sketchpad and visual imagery.

To mitigate this deficiency, we introduce Cognitive Supersensing, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities by integrating a Latent Visual Imagery Prediction (LVIP) head that jointly learns sequences of visual cognitive latent embeddings and aligns them with the answer, thereby forming vision-based internal reasoning chains. We further introduce a reinforcement learning stage that optimizes text reasoning paths based on this grounded visual latent.

To evaluate the cognitive capabilities of MLLMs, we present CogSense-Bench, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions: fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines.

Citation

@misc{li2026cognitivesupersensingmultimodallarge,
      title={Toward Cognitive Supersensing in Multimodal Large Language Model}, 
      author={Boyi Li and Yifan Shen and Yuanzhe Liu and Yifan Xu and Jiateng Liu and Xinzhuo Li and Zhengyuan Li and Jingyuan Zhu and Yunhan Zhong and Fangzhou Lan and Jianguo Cao and James M. Rehg and Heng Ji and Ismini Lourentzou and Xu Cao},
      year={2026},
      eprint={2602.01541},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.01541}, 
}

PediaMed AI • University of Illinois at Urbana-Champaign

About

[COLM 2025] What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors