Paper: Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.
Create and activate a clean conda environment, then install the required dependencies:
conda create -n cat python=3.10 -y
conda activate cat
pip install -r requirements.txtDatasets should be in Hugging Face Parquet format with the following required fields:
images: list of images as bytes dictionaries, e.g.[{"bytes": ...}]prompt: text prompt (include<image>token when an image is present)ground_truth: target answer string (some reward functions expect specific tags like<answer>...</answer>,<csv>...</csv>,<programability>yes|no</programability>)
We provide conversion scripts in my_dataset/ for popular chart understanding datasets (ChartBench/ChartQA/CharXiv). Simply edit the script constants to point to your local raw data directory and run the script to generate benchmark_*.parquet files.
To train the model, configure and run the provided training script:
bash examples/qwen2_5vl_7b.shImportant Configuration:
- Configure these variables in the script according to your setup:
MODEL_PATH,TRAIN_DATA,VAL_DATA,EXPERIMENT_NAME,FORMAT_PROMPT,REWARD_FUNCTION,NUM_GPUS, and optionallyTENSORBOARD_DIR - The script uses
python -m verl.trainer.mainwith decision prompt and decision reward by default. Modify parameters as needed for your specific requirements.
To evaluate the trained model, configure and run the validation script:
bash examples/val_sh/val_chartbench.shConfiguration Requirements:
- Set the following variables:
MODEL_PATH,TRAIN_DATA,VAL_DATA,FORMAT_PROMPT,REWARD_FUNCTION,NUM_GPUS, andVAL_OUTPUT_FILE - This script runs in validation-only mode (
trainer.val_only=true) and outputs detailed generations and evaluation metrics.
- Model: Qwen2_5vl_7b_decision_CaT
- Dataset: Decision_CaT
examples/format_prompt/: Jinja2 template prompts for code generation, chain-of-thought, and decision makingexamples/reward_function/: reward functions corresponding to different prompt templatesexamples/config.yaml: default training configurationexamples/qwen2_5vl_7b.sh: training script example for Qwen2.5-VL-7B modelexamples/val_sh/val_chartbench.sh: validation script example for ChartBench evaluationmy_dataset/: data conversion scripts to transform raw datasets into Parquet formatscripts/model_merger.py: utility to merge FSDP model shards and export Hugging Face compatible weightsverl/: core training framework integrating Ray, FSDP, and vLLMrequirements.txt: Python package dependencies
If you find this work useful for your research, please cite our paper:
@misc{tang2025visualprogrammabilityguidecodeasthought,
title={Visual Programmability: A Guide for Code-as-Thought in Chart Understanding},
author={Bohao Tang and Yan Ma and Fei Zhang and Jiadi Su and Ethan Chern and Zhulin Hu and Zhixin Wang and Pengfei Liu and Ya Zhang},
year={2025},
eprint={2509.09286},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.09286},
}- This work is built upon the EasyR1 training framework, which provides the efficient and scalable RL training infrastructure.
- We gratefully acknowledge the open-source communities and contributors of HuggingFace Transformers, vLLM, Ray, FlashAttention, and Qwen2.5-VL for making this research possible.
This project is licensed under the Apache-2.0 License. See individual file headers for details.
