MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning
Shengyuan Liu1 Liuxin Bao1 Qi Yang2,3 Wanting Geng2,4 Boyun Zheng1 Chenxin Li1 Wenting Chen5 Houwen Peng2✉ Yixuan Yuan1✉
1Chinese University of Hong Kong 2Hunyuan Group, Tencent 3Institute of Automation, the Chinese Academy of Sciences 4Dalian University of Technology 5Stanford University
✉ Corresponding Author.
In this work, we propose MedSAM-Agent, a framework that reformulates interactive segmentation as a multi-step autonomous decision-making process. First, we introduce a hybrid prompting strategy for expert-curated trajectory generation, enabling the model to internalize human-like decision heuristics and adaptive refinement strategies. Furthermore, we develop a two-stage training pipeline that integrates multi-turn, end-to-end outcome verification with a clinical-fidelity process reward design to promote interaction parsimony and decision efficiency.

- Release the SFT and RL dataset for MedSAM-Agent.
- Release the code of trajectory generation.
- Release the paper, model and the base code for MedSAM-Agent.
- We use python 3.11/CUDA 12.9/torch 2.8.0 for implementation.
- We train our models on 8 NVIDIA H20 GPUs with 96G memory.
# create environment
conda create -n msagent python=3.11
conda activate msagent
pip install -r requirements.txtWe support three segmentation backbones: MedSAM2, SAM, and IMISNet. Please download the checkpoints from:
For SAM2.1 and IMISNet, please also download the dependency repositories and install them:
cd third_party/
git clone https://github.com/facebookresearch/sam2.git
cd sam2
pip install -e .In this repo, our dataset is based on BioMedParse and UniBioMed. We evaluate our model on 6 modalities and 21 datasets. Details of dataset split can be found in our paper.
We will release the SFT trajectory dataset and RL training dataset soon.
- Single sample (one image): Run the script in infer/run_single_inference.py with your paths:
cd infer
python run_single_inference.py \
--img-path infer/demo/BTCV-0-106_CT_abdomen.png \
--target-description "right kidney in abdomen CT" \
--model-path /path/to/mllm_model \
--seg-checkpoint /path/to/MedSAM2_latest.pt \
--seg-model medsam- Whole-dataset / multi-GPU: Edit the variables at the top of infer/run_batch_inference.sh:
MODEL_PATH(local Qwen checkpoint orgpt),SEG_MODEL(medsam,sam,imisnet), segmentation checkpoints/configs,DATA_ROOT,DATASETS,SPLIT, GPU topology (N_GPUS,PROCESSES_PER_GPU).
bash run_batch_inference.shPlease follow the instructions in RL-verl/README.md to set up the Verl environment.
Notice: the version of Sglang==0.5.4
We support two segmentation backbones for RL training: MedSAM2 and IMISNet.
First, start the API server for segmentation model inference. You can choose either MedSAM2 or IMISNet by modifying the variables in RL-verl/api_server/run_api.sh:
bash RL-verl/api_server/run_api.sh-
You can modify the following variables in
run.sh:MODEL: segmentation backbone, options:medsam2orimisnetSAVE_CHECKPOINT_DIR: root directory to save Verl training outputsDATASET_TRAIN: path to training dataset parquet fileDATASET_VAL: path to validation dataset parquet fileREF_MODEL_PATH: path to the base MLLM model (local checkpoint orQwen/Qwen3-VL-8B-Instruct)
bash RL-verl/recipe/medsam_agent/run.shGreatly appreciate the tremendous effort for the following projects!
If you find this work helpful for your project, please consider citing our paper.
@misc{liu2026medsamagentempoweringinteractivemedical,
title={MedSAM-Agent: Empowering Interactive Medical Image Segmentation with Multi-turn Agentic Reinforcement Learning},
author={Shengyuan Liu and Liuxin Bao and Qi Yang and Wanting Geng and Boyun Zheng and Chenxin Li and Wenting Chen and Houwen Peng and Yixuan Yuan},
year={2026},
eprint={2602.03320},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.03320},
}
