This is the official repository for the paper "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning".
Code and models will be released soon.
To mitigate the perceptual bottleneck in VLMs, recent approaches often rely on external tools or explicit intermediate visual cues (e.g., generated masks, bounding boxes, or latent tokens) during inference. However, these paradigms face three critical limitations:
-
Shape Rigidity: Coarse boxes or masks fail to capture irregular, fine-grained evidence (e.g., thin polylines or specific intersections in charts).
-
Limited Generalization: Task-specific tools generalize poorly across diverse domains.
-
Inference Overhead: Multi-step visual reasoning increases computation costs and latency. BiPS takes a different route. Instead of using visual cues as inference-time crutches, we transform them into training signals to internalize perception.
BiPS shapes the model's internal policy through a two-stage curriculum using programmatically generated views via chart code editing:
-
Consistency Stage: Minimizes divergence between the original image and an Evidence-Preserving View, teaching the model to focus on complete, supporting visual details.
-
Separation Stage: Maximizes divergence from an Evidence-Ablated View, penalizing the model for relying on text-only shortcuts when visual evidence is missing.
By strictly enforcing these constraints during training, BiPS achieves fine-grained visual grounding without any additional inference cost. Across 8 benchmarks, it boosts Qwen2.5-VL-7B by an average of 8.2%, demonstrating strong cross-domain generalization.
If you find this work helpful in your research, please cite our paper:
@article{zhang2025bips,
title={See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning},
author={Zhang, Shuoshuo and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Yang, Yujiu and Wang, Rui},
journal={arXiv preprint arXiv:2512.22120},
year={2025}
}
