InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Abstract

Recent advances in generative diffusion models have enabled text-controlled synthesis of realistic and diverse images with impressive quality. Despite these remarkable advances, the application of text-to-image generative models in computer vision for standard visual recognition tasks remains limited. The current de facto approach for these tasks is to design model architectures and loss functions that are tailored to the task at hand. In this paper, we develop a unified language interface for computer vision tasks that abstracts away task-specific design choices and enables task execution by following natural language instructions. Our approach involves casting multiple computer vision tasks as text-to-image generation problems. Here, the text represents an instruction describing the task, and the resulting image is a visually-encoded task output. To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification. We then use a large language model to paraphrase prompt templates that convey the specific tasks to be conducted on each image, and through this process, we create a multi-modal and multi-task training dataset comprising input and output images along with annotated instructions. Following the InstructPix2Pix architecture, we apply instruction-tuning to a text-to-image diffusion model using our constructed dataset, steering its functionality from a generative model to an instruction-guided multi-task vision learner. Experiments demonstrate that our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models. Moreover, it exhibits compelling generalization capabilities to unseen data, categories, and user instructions.

Problem Overview

Current generalist vision models typically unify only the encoder, necessitating the use of distinct decoders for different tasks and requiring explicit coding to specify the particular vision task to be performed. Our aim is to establish a unified language interface that bridges user instructions with diverse visual tasks. In essence, we feed natural language instructions to the model, which then comprehends our intent and executes various vision tasks accordingly. To achieve this, we treat distinct vision tasks—such as image classification, segmentation, object detection, and depth estimation—as image generation. We employ instruction tuning to fine-tune a stable diffusion model, thereby creating a unified language interface that effectively translates natural language directives into a range of visual outcomes.

InstructCV Training Pipeline

LLM Based Instruction Generation

Humans often use various expressions to convey the same meaning. To ensure that models fully comprehend the instructions input by users, it's crucial to incorporate linguistic diversity when constructing text instructions. To achieve this, we have employed a T5 model fine-tuned on a text rewriting dataset. Then we use this model to rewrite user input instructions, thereby enriching the linguistic diversity of the instructions.

Multi-modal & Multi-task Training Data Construction.

We created a multi-modal, multi-task training dataset by pairing the previously generated instructions with images and image labels as a set of data pairs. Through this approach, we transformed existing standard datasets into the dataset we needed. We used four datasets: Oxford-pets, MS COCO, ADE-20K, and NYUv2, based on them to construct the training dataset in our desired format of image-text pairs.

Instruction-tuning a text-to-image diffusion model

We fine-tuned Stable Diffusion using our constructed multi-modal, multi-task dataset, transforming it into a unified language interface. Our dataset comprises ~0.5 million image-text pairs. We trained it for 10 epochs on an eight-card A100 setup. For detailed modifications to the model, please refer to our paper.

Future Directions

Compositional Instructions. When introducing more tasks and user instructions, combining instructions offers a potential solution for implementing zero-shot task abilities. For example, after the model has learned classification and semantic segmentation tasks, new tasks such as instance segmentation or panoptic segmentation can be achieved through the combination of instructions like "show red if there is a bird" (a classification instruction) and "Segment the cat" (a segmentation instruction) into "Segment the cats, color them in blue and red".

Faster Inference Speed. The inference speed of our model lags behind specialized task-specific models and falls short of meeting the real-time inference requirements for tasks such as object detection and segmentation.

Learn from Human Feedback. The semantic flexibility of InstructCV is constrained by the richness and diversity of our instruction-tuning dataset, which is currently generated by rephrasing a limited set of template prompts. This raises questions for future work: can this learning paradigm accommodate instructions that introduce more nuanced conditions? For example, an instruction might cap the count of objects to be detected. Exploring such ideas might require the integration of strategies such as learning from human feedback, which could enable more versatile generalist models by improving alignment of task outputs with more complex prompts.

BibTeX

@article{gan2023instructcv, title={InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists}, author={Gan, Yulu and Park, Sungwoo and Schubert, Alexander and Philippakis, Anthony and Alaa, Ahmed}, journal={International Conference on Learning Representations (ICLR)}, year={2024}}