Skip to content
/ BiPS Public

Official implementation of "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning"

Notifications You must be signed in to change notification settings

zss02/BiPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 

Repository files navigation

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Paper checkpoint

Introduction

This is the official repository for the paper "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning".

Code and models will be released soon.

Motivation

To mitigate the perceptual bottleneck in VLMs, recent approaches often rely on external tools or explicit intermediate visual cues (e.g., generated masks, bounding boxes, or latent tokens) during inference. However, these paradigms face three critical limitations:

  • Shape Rigidity: Coarse boxes or masks fail to capture irregular, fine-grained evidence (e.g., thin polylines or specific intersections in charts).

  • Limited Generalization: Task-specific tools generalize poorly across diverse domains.

  • Inference Overhead: Multi-step visual reasoning increases computation costs and latency. BiPS takes a different route. Instead of using visual cues as inference-time crutches, we transform them into training signals to internalize perception.

Method: Bi-directional Perceptual Shaping

BiPS shapes the model's internal policy through a two-stage curriculum using programmatically generated views via chart code editing:

  • Consistency Stage: Minimizes divergence between the original image and an Evidence-Preserving View, teaching the model to focus on complete, supporting visual details.

  • Separation Stage: Maximizes divergence from an Evidence-Ablated View, penalizing the model for relying on text-only shortcuts when visual evidence is missing.

By strictly enforcing these constraints during training, BiPS achieves fine-grained visual grounding without any additional inference cost. Across 8 benchmarks, it boosts Qwen2.5-VL-7B by an average of 8.2%, demonstrating strong cross-domain generalization.

πŸ“ Citation

If you find this work helpful in your research, please cite our paper:

@article{zhang2025bips,
  title={See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning},
  author={Zhang, Shuoshuo and Zhang, Yizhen and Fu, Jingjing and Song, Lei and Bian, Jiang and Yang, Yujiu and Wang, Rui},
  journal={arXiv preprint arXiv:2512.22120},
  year={2025}
}

About

Official implementation of "See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published