Text4Seg: Reimagining Image Segmentation as Text Generation

Mengcheng Lan¹ Chaofeng Chen¹ Yue Zhou¹ Jiaxing Xu² Yiping Ke² Xinjiang Wang³ Litong Feng³ Wayne Zhang³

¹S-Lab, Nanyang Technological University ²CCDS, Nanyang Technological University ³SenseTime Research

Abstract

Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks; however, effectively integrating image segmentation into these models remains a significant challenge. In this paper, we introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. This unified representation allows seamless integration into the auto-regressive training pipeline of MLLMs for easier optimization. We demonstrate that representing an image with $16\times16$ semantic descriptors yields competitive segmentation performance. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Extensive experiments across various vision tasks, such as referring expression segmentation and comprehension, show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones. Our approach provides an efficient, scalable solution for vision-centric tasks within the MLLM framework.

Different paradigms of MLLMs based image segmentation: (a) embeddings-as-mask paradigm that relies on additional segmentation decoder and loss (e.g., LISA); (b) polygon coordinates for instance segmentation (e.g., VisionLLM); (c) our text-as-mask paradigm that relies on semantically consistent text sequences..

📢 Latest Updates

09/2025: We release the extended version Text4Seg++.
12/2024: We release the code and datasets.

Dependencies and Installation

# git clone this repository
git clone https://github.com/mc-lan/Text4Seg.git
cd Text4Seg

# create new anaconda env
conda create -n text4seg python=3.10
conda activate text4seg

# install torch and dependencies
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

For the project based on ms-swift (Tables 1, 2, 3), please check out Text4Seg/ms-swift.

Datasets

llava_v1_5_mix665k.json
- COCO: train2017
- GQA: images
- OCR-VQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
Referring expression segmentation dataset
- refCOCO
- refCOCO+
- refCOCOg
- refCLEF (saiapr_tc-12)
Semantic segmentation dataset

Download them from the above links, and organize them as follows.

├── playground/data
│   ├── refer_seg
│   │   ├── grefcoco
|   |       ├── grefs(unc).json
│   │   ├── images
|   |       ├── coco_2014
|   |       ├── saiapr_tc-12
│   │   ├── refclef
|   |       ├── instances.json
│   │   ├── refcoco
|   |       ├── instances.json
│   │       └── ...
│   │   ├── refcoco+
|   |       ├── instances.json
│   │       └── ...
│   │   └── refcocog
|   |       ├── instances.json
│   │       └── ...
│   ├── semantic_seg
│   │   ├── ADE20K
│   │       ├── annotations
│   │       └── images
│   │   ├── PAS20
│   │       ├── JPEGImages
│   │       ├── SegmentationClass
│   │       └── val.txt
│   │   └── PC59
│   │       ├── JPEGImages
│   │       ├── SegmentationClassContext
│   │       └── pascalcontext_val.txt
|   ├── coco
|   │   └── train2017
|   ├── cocostuff
|   │   └── annotations
|   |       ├── 000000000009_labelTrainIds.png
|   |       └── ...
|   ├── gqa
│   |   └── images
|   ├── ocr_vqa
│   |   └── images
|   ├── textvqa
│   |   └── train_images
|   ├── vg
|   |    ├── VG_100K
|   |    └── VG_100K_2
|   └── llava_v1_5_mix665k.json

To evaluate the VQA performance, you need to prepare the evaluation datasets. Please download the eval.zip and extract its contents to ./playground/data/eval.

Json files

Generate the json files:

python playground/data/create_json/create_refcoco.py
python playground/data/create_json/create_grefercoco.py
python playground/data/create_json/create_cocostuff.py

Pre-trained weights

Note that you don't need to manually download these pre-trained weights if you're only performing quick inference.

Download the clip-vit-large and vicuna-7b-v1.5 weights from the Hugging Face to pre_trained folder. Download the mm_project.bin to checkpoints/llava-v1.5-7b-pretrain folder. Download the sam-h to llava/model/segment_anything folder.

Quick Inference

Please note that this checkpoint was trained on a combination of the LLaVA v1.5 mix665k dataset, the RefCOCO series, the GrefCOCO dataset, and the COCOStuff dataset for demonstration purposes.

python llava/eval/run_llava.py --model-path="lmc22/text4seg-llava-7b-p24" --image-file="images/horses.jpg" --query="Please segment the white horse in this image."

It will automatically download the checkpoint from the Huggingface.

Checkpoints

Download the our checkpoints (Tables 4) from OneDrive to checkpoints folder.

Model evaluation

Referring expression segmengtation:

bash scripts/v1_5/eval/refer_seg.sh

Open-vocabulary semantic segmentation:

bash scripts/v1_5/eval/semantic_seg.sh

Model training

bash scripts/v1_5/fintune_lora.sh
bash scripts/v1_5/eval/refer_seg.sh

Citation

@misc{lan2024text4segreimaginingimagesegmentation,
      title={Text4Seg: Reimagining Image Segmentation as Text Generation}, 
      author={Mengcheng Lan and Chaofeng Chen and Yue Zhou and Jiaxing Xu and Yiping Ke and Xinjiang Wang and Litong Feng and Wayne Zhang},
      year={2024},
      eprint={2410.09855},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.09855}, 
}

License

This project is licensed under NTU S-Lab License 1.0. Redistribution and use should follow this license.

Acknowledgement

This study is supported under the RIE2020 Industry Align- ment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

This implementation is based on LLaVa and ms-swift. Thanks for the awesome work.

Contact

If you have any questions, please feel free to reach out at [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
images		images
llava		llava
ms-swift		ms-swift
playground/data/create_json		playground/data/create_json
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

Text4Seg: Reimagining Image Segmentation as Text Generation

Abstract

📢 Latest Updates

Dependencies and Installation

Datasets

Json files

Pre-trained weights

Quick Inference

Checkpoints

Model evaluation

Model training

Citation

License

Acknowledgement

Contact

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Licenses found

mc-lan/Text4Seg

Folders and files

Latest commit

History

Repository files navigation

Text4Seg: Reimagining Image Segmentation as Text Generation

Abstract

📢 Latest Updates

Dependencies and Installation

Datasets

Json files

Pre-trained weights

Quick Inference

Checkpoints

Model evaluation

Model training

Citation

License

Acknowledgement

Contact

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages