Awesome-Open-Vocabulary-Semantic-Segmentation

If you find this project helpful, please consider giving it a star ⭐.

Open-Vocabulary Semantic Segmentation (mainly updated by @tbh3223)
Zero-shot Semantic Segmentation (mainly updated by @tbh3223)
Referring-Image-Segmentation (mainly updated by @ghost-000)
- Fully-Supervised Methods
- Weakly-Supervised Methods
Open-Vocabulary Object Detection (mainly updated by @tbh3223)
Universal Segmentation and Related Work (mainly updated by @tbh3223)
Other Open-Vocabulary Related Work
Related Survey

Open-Vocabulary Semantic Segmentation

Fully-Supervised Open-Vocabulary Semantic Segmentation

The model is trained on fully-supervised semantic segmentation datasets with pixel-level annotations (e.g., COCO Stuff dataset).

[LSeg] | ICLR'22 | Language-driven Semantic Segmentation | [pdf] | [code]
[OpenSeg] | ECCV'22 | Scaling Open-vocabulary Image Segmentation with Image-level Labels | [pdf] | [code]
[Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model | [pdf] | [code]
[SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[MaskCLIP] | ICML'23 | Open-Vocabulary Universal Image Segmentation with MaskCLIP | [pdf] | [code]
[OVSeg] | CVPR'23 | Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP | [pdf] | [code]
[X-Decoder] | CVPR'23 | Generalized Decoding for Pixel, Image, and Language | [pdf] | [code]
[SAN] | CVPR'23(Highlight) | Side Adapter Network for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[SAN] | TAPMI'23 | SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation | [pdf] | [code]
[ODISE] | CVPR'23 | Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models | [pdf] | [code]
[FreeSeg] | CVPR'23 | FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation | [pdf] | [code]
[OpenSeeD] | ICCV'23 | A Simple Framework for Open-Vocabulary Segmentation and Detection | [pdf] | [code]
[GKC] | ICCV'23 | Global Knowledge Calibration for Fast Open-Vocabulary Segmentation | [pdf]
[OPSNet] | ICCV'23 | Open-vocabulary Panoptic Segmentation with Embedding Modulation | [pdf] | [code]
[MasQCLIP] | ICCV'23 | MasQCLIP for Open-Vocabulary Universal Image Segmentation | [pdf]
[DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network | [pdf] | [code]
[Li et al.] | ICCV'23 | Open-vocabulary Object Segmentation with Diffusion Models | [pdf] | [code]
[HIPIE] | NeurIPS'23 | Hierarchical Open-vocabulary Universal Image Segmentation | [pdf] | [code]
[FC-CLIP] | NeurIPS'23 | Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP | [pdf] | [code]
[MAFT] | NeurIPS'23 | Learning Mask-aware CLIP Representations for Zero-Shot Segmentation | [pdf] | [code]
[ADA] | NeurIPS'23 | Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation | [pdf]
[Dao et al.] | TMM | Class Enhancement Losses with Pseudo Labels for Open-Vocabulary Semantic Segmentation | [pdf]
[SELF-SEG] | ArXiv'23.12 | Self-Guided Open-Vocabulary Semantic Segmentation | [pdf]
[OpenSD] | ArXiv'23.12 | OpenSD: Unified Open-Vocabulary Segmentation and Detection | [pdf] | [code]
[SILC] | ArXiv'23.12 | SILC: Improving Vision Language Pretraining with Self-Distillation | [pdf]
[CLIPSelf] | ICLR'24(Spotlight) | CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction | [pdf] | [code]
[RENOVATE] | ArXiv'24.03 | Renovating Names in Open-Vocabulary Segmentation Benchmarks | [pdf]
[DreamCLIP] | ECCV'24 | DreamLIP: Language-Image Pre-training with Long Captions | [pdf] | [code]
[CAT-Seg] | CVPR'24 | CAT-Seg : Cost Aggregation for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[SED] | CVPR'24 | SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[SCAN] | CVPR'24 | Open-Vocabulary Segmentation with Semantic-Assisted Calibration | [pdf] | [code]
[OpenTrans] | CVPR'24 | Transferable and Principled Efficiency for Open-Vocabulary Segmentation | [pdf] | [code])
[H-CLIP] | ArXiv'24.05 | Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation | [pdf]
[OpenDAS] | ArXiv'24.05 | OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation | [pdf]
[USE] | CVPR'24 | USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation | [pdf]
[EBSeg] | CVPR'24 | Open-Vocabulary Semantic Segmentation with Image Embedding Balancing | [pdf] | [code])
[MAFT+] | ECCV'24 | Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation | [pdf] | [code])
[R-Adapter] | ECCV'24 | Efficient and Versatile Robust Fine-Tuning of Zero-shot Models | [pdf] | [code])
[MROVSeg] | ArXiv'24.08 | MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation | [pdf]
[FrozenSeg] | ArXiv'24.09 | FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation | [pdf] | [code]
[GBA] | ArXiv'24.09 | Generalization Boosted Adapter for Open-Vocabulary Segmentation | [pdf]
[SMART] | ArXiv'24.09 | Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation | [pdf]
[ESC-Net] | ArXiv'24.11 | Effective SAM Combination for Open-Vocabulary Semantic Segmentation | [pdf]
[Mask-Adapter] | CVPR'25 | Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation | [pdf] | [code]
[ERR-Seg] | ArXiv'25.01 | Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[EOV-Seg] | AAAI'25 | EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation | [pdf] | [code]
[SemLA] | CVPR'25 | Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation | [pdf] | [code] | (Note: new benchmark.)
[FGA-Seg] | ArXiv'25.01 | FGA-Seg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[OMTSeg] | ICIP'24 | Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model | [pdf] | [code]
[MaskCLIP++] | ArXiv'25.03 | MaskCLIP++: High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation | [pdf] | [code]
[OVSNet] | ICCV'25 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation | [pdf]
[R-SC-CLIPSelf] | ICLR'25 | Refining CLIP's Spatial Awareness: A Visual-Centric Perspective | [pdf]
[OpenWorldSAM] | NeurIPS'25 Spotlight | OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts | [pdf]
[Spectrum] | AAAI'26 | Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts | [pdf] | [project]
[VocAlign] | BMVC'25 | Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[SAM-MI] | ArXiv'25.11 | SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM | [pdf]
[X-Agent] | ACM MM'25 | Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[Personalized OVSS] | ICCV'25 | Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation | [pdf]

Weakly-Supervised Open-Vocabulary Semantic Segmentation

[text-supervised/language-supervised] The model is trained on weakly supervised datasets with only image-level annotations/captions (e.g., CC12M dataset).

[GroupViT] | CVPR'22 | GroupViT: Semantic Segmentation Emerges from Text Supervision | [pdf] | [code]
[ViL-Seg] | ECCV'22 | Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding | [pdf]
[MaskCLIP+] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP | [pdf] | [code]
[ViewCo] | ICLR'23 | Viewco: Discovering Text-supervised Segmentation Masks via Multi-view Semantic Consistency | [pdf]
[SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[CLIP-S4] | CVPR'23 | CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation | [pdf]
[PACL] | CVPR'23 | Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning | [pdf]
[OVSegmentor] | CVPR'23 | Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision | [pdf] | [code]
[SimSeg] | CVPR'23 | A Simple Framework for Text-Supervised Semantic Segmentation | [pdf] | [code]
[TCL] | CVPR'23 | Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs | [pdf] | [code]
[SimCon] | ArXiv'23.02 | SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation | [pdf]
[Zhang et al.] | ArXiv'23.04 | Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation | [pdf]
[ZeroSeg] | ICCV'23 | Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only | [pdf]
[CLIPpy] | ICCV'23 | Perceptual Grouping in Contrastive Vision-Language Models | [pdf]
[MixReorg] | ICCV'23 | MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation | [pdf]
[CoCu] | NeurIPS'23 | Bridging Semantic Gaps for Language-Supervised Semantic Segmentation | [pdf] | [code]
[PGSeg] | NeurIPS'23 | Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[SAM-CLIP] | ArXiv'23.10 | SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding | [pdf]
[CLIP-DINOiser] | ArXiv'23.12 | CLIP-DINOiser: Teaching CLIP a few DINO tricks | [pdf] | [code]
[TagAlign] | ArXiv'23.12 | TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification | [pdf] | [code]
[S-Seg] | ArXiv'24.01 | Exploring Simple Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[CLIPSelf] | ICLR'24(Spotlight) | CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction | [pdf] | [code]
[Uni-OVSeg] | ArXiv'24.02 | Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision | [pdf] | [code]
[MGCA] | ArXiv'24.03 | Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision | [pdf]
[TTD] | ArXiv'24.04 | TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias | [pdf] | [code]
[CoDe] | CVPR'24 | Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation | [pdf]
[LLM-Supervision] | ArXiv'24.03 | Training-Free Semantic Segmentation via LLM-Supervision | [pdf]
[ProxyCLIP] | ECCV'24 | ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation | [pdf] | [code]
[LPOSS] | CVPR'25 | LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation | [pdf] | [code]
[SynSeg] | AAAI'26 | SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation | [pdf]
[RF-CLIP] | AAAI'26 | Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective | [pdf] | [code]

Training-Free Open-Vocabulary Semantic Segmentation

The model is modified from the off-the-shelf large models (e.g., CLIP, Diffusion models) without an additional training phase. Note that, the large models have already been trained with some datasets (e.g., image-caption datasets).

[MaskCLIP] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP | [pdf] | [code]
[ReCo] | NeurIPS'22 | ReCo: Retrieve and Co-segment for Zero-shot Transfer | [pdf] | [code]
[CLIP Surgery] | ArXiv'23.04 | CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks | [pdf] | [code]
[OVDiff] | ArXiv'23.06 | Diffusion Models for Zero-Shot Open-Vocabulary Segmentation | [pdf]
[DiffSegmenter] | ArXiv'23.09 | Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter | [pdf] | [code]
[IPSeg] | IJCV'24 | Towards Training-free Open-world Segmentation via Image Prompting Foundation Models | [pdf]
[SCLIP] | ArXiv'23.12 | SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference | [pdf]
[GEM] | CVPR'24 | Grounding Everything: Emerging Localization Properties in Vision-Language Transformers | [pdf] | [code]
[CLIP-DIY] | WACV'24 | CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free | [pdf]
[FOSSIL] | WACV'24 | FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval | [pdf]
[TagCLIP] | AAAI'24 | TagCLIP: A Local-to-Global Framework to Enhance Open-VocabularyMulti-Label Classification of CLIP Without Training | [pdf] | [code]
[EmerDiff] | ICLR'24 | EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models | [pdf] | [code]
[FreeSeg-Diff] | ArXiv'24.03 | FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models | [pdf] | [code]
[MaskDiffusion] | ArXiv'24.03 | MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation | [pdf] | [code]
[TAG] | ArXiv'24.03 | TAG: Guidance-free Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[Sun et al.] | ArXiv'24.04 | Training-Free Semantic Segmentation via LLM-Supervision | [pdf]
[NACLIP] | ArXiv'24.04 | Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation| [pdf] | [code]
[PnP-OVSS] | CVPR'24 | Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models | [pdf] | [code]
[CaR] | CVPR'24 | CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor | [pdf] | [code]
[Wang et al.] | CVPR'24 | Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[FreeDA] | CVPR'24 | Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation| [pdf] | [code]
[Yang et al.] | ArXiv'24.05 | Tuning-free Universally-Supervised Semantic Segmentation | [pdf]
[CLIPTrase] | ECCV'24 | Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation | [pdf] | [code]
[ClearCLIP] | ECCV'24 | ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference | [pdf] | [code]
[ProxyCLIP] | ECCV'24 | ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation | [pdf] | [code]
[LaVG] | ECCV'24 | In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[ITACLIP] | ArXiv'24.11 | ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements | [pdf] | [code]
[Trident] | ArXiv'24.11 | Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation | [pdf] | [code]
[CorrCLIP] | ArXiv'24.11 | CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation | [pdf]
[CLIPer] | ArXiv'24.11 | CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[ResCLIP] | ArXiv'24.11 | ResCLIP: Residual Attention for Training-free Dense Vision-language Inference | [pdf] | [code]
[SC-CLIP] | ArXiv'24.11 | Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation | [pdf] | [code]
[Talk2DINO] | ArXiv'24.11 | Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation | [pdf] | [code]
[CASS] | CVPR'25 | Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[ReME] | ICCV'25 | ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation | [pdf] | [code]
[SFP] | ICCV'25 | Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation | [pdf] | [code]
[FSA] | ICCV'25 | Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation | [pdf] | [code]
[FreeCP] | ICCV'25 | Training-Free Class Purification for Open-Vocabulary Semantic Segmentation | [pdf] | [code]

Others

[EntitySeg] | ArXiv'23.11 | Rethinking Evaluation Metrics of Open-Vocabulary Segmentation | [pdf] | [code]
[PixelCLIP] | NeurIPS'24 | Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels | [pdf] | [code]

Zero-Shot Semantic Segmentation

Different from open-vocabulary segmentation (cross-dataset), zero-shot methods split each dataset to seen classes and unseen classes.

[ZegFormer] | CVPR'22 | ZegFormer: Decoupling Zero-Shot Semantic Segmentation | [pdf] | [code]
[Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model | [pdf] | [code]
[ZegCLIP] | CVPR'23 | ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation | [pdf] | [code]
[PADing] | CVPR'23 | Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation | [pdf] | [code]
[DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network | [pdf] | [code]
[SPT] | AAAI'24 | Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation | [pdf] | [code]
[Chen et al.] | ArXiv'24.02 | Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation | [pdf]
[LDVC] | ArXiv'24.03 | Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation | [pdf]
[OTSeg] | ArXiv'24.03 | OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation | [pdf]
[Cascade-CLIP] | ICML'24 | Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation | [pdf] | [code]
[SimZSS] | ArXiv'24.07 | A Simple Framework for Open-Vocabulary Zero-Shot Segmentation | [pdf]
[CaR] | CVPR'24 | CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor | [pdf] | [code]

Referring Image Segmentation

Fully-Supervised Referring Image Segmentation

[CARIS] | ACM MM'23 | CARIS: Context-Aware Referring Image Segmentation | [pdf] | [code]
[BKINet] | TMM'23 | Bilateral Knowledge Interaction Network for Referring Image Segmentation | [pdf] | [code]
[Group-RES] | ICCV'23 | Advancing Referring Expression Segmentation Beyond Single Image | [pdf] | [code]
[RIS-DMMI] | ICCV'23 | Beyond One-to-One: Rethinking the Referring Image Segmentation | [pdf] | [code]
[ETRIS] | ICCV'23 | Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation | [pdf] | [code]
[SEEM] | ArXiv'23.04 | Segment Everything Everywhere All at Once | [pdf] | [code]

Weakly-Supervised Referring Image Segmentation

[Strudel et al.] | ArXiv'22.05 | Weakly-supervised segmentation of referring expressions | [pdf]
[Kim et al.] | ICCV'23 | Shatter and Gather: Learning Referring Image Segmentation with Text Supervision | [pdf] | [code]
[TRIS] | ICCV'23 | Referring Image Segmentation Using Text Supervision | [pdf] | [code]
[Jungbeom Lee et al.] | ICCV'23 | Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency | [pdf]
[PPT] | CVPR'24 | Curriculum Point Prompting for Weakly-Supervised Referring Segmentation | [pdf]

Open-Vocabulary Object Detection

[RO-ViT] | CVPR'23(Highlight) | Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers | [pdf] | [code]
[CAT] | CVPR'23 | CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection | [pdf] | [code]
[DetCLIPv2] | CVPR'23 | DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment | [pdf]
[CondHead] | CVPR'23 | Learning to Detect and Segment for Open Vocabulary Object Detection | [pdf]
[CORA] | CVPR'23 | CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching | [pdf] | [code]
[ovdet] | CVPR'23 | Aligning Bag of Regions for Open-Vocabulary Object Detection | [pdf] | [code]
[OADP] | CVPR'23 | Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection | [pdf] | [code]
[F-VLM] | ICLR'23 | F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models | [pdf] | [code]
[mm-ovod] | ICML 2023 | Multi-Modal Classifiers for Open-Vocabulary Object Detection | [pdf] | [code]
[SGDN] | ArXiv'23.07 | Open-Vocabulary Object Detection via Scene Graph Discovery | [pdf]
[MMC-Det] | ArXiv'23.08 | Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection | [pdf]
[SAS-Det] | CVPR'24 | Taming Self-Training for Open-Vocabulary Object Detection | [pdf] | [code]
[DITO] | ArXiv'23.09 | Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection | [pdf] | [code]
[EdaDet] | ICCV'23 | EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment | [pdf] | [code]
[LP-OVOD] | WACV'24 | LP-OVOD: Open-Vocabulary Object Detection by Linear Probing | [pdf] | [code]
[DST-Det] | ArXiv'23.10 | DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection | [pdf] | [code]
[CoDet] | NeurIPS'23 | CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection | [pdf] | [code]
[PLAC] | ArXiv'23.12 | Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection | [pdf]
[Sambor] | ArXiv'23.12 | Boosting Segment Anything Model Towards Open-Vocabulary Learning | [pdf] | [code]
[DVDet] | ICLR'24 | LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors | [pdf]
[DetCLIPv3] | CVPR'24 | DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection | [pdf]
[AggDet] | ArXiv'24.04 | Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation | [pdf]
[RALF] | CVPR'24 | Retrieval-Augmented Open-Vocabulary Object Detection | [pdf] | [code]
[Chhipa et al.] | ArXiv'24.06 | Investigating Robustness of Open-Vocabulary Foundation Object Detectors under Distribution Shifts | [pdf]
[SHiNe] | CVPR'24(Highlight) | SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection | [pdf] | [code]
[RTGen] | ArXiv'24.06 | RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection | [pdf] | [code]
[LBP] | CVPR'24 | Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection | [pdf]
[YOLO-World] | CVPR'24 | Real-Time Open-Vocabulary Object Detection | [pdf] | [code]
[OV-DINO] | ArXiv'24.07 | Unified Open-Vocabulary Detection with Language-Aware Selective Fusion | [pdf] | [code]
[OVLW-DETR] | ArXiv'24.07 | OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer | [pdf] | [code]
[LaMI-DETR] | ECCV'24 | LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction | [pdf] | [code]
[MarvelOVD] | ECCV'24 | MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection | [pdf] | [code]
[DetLH] | NeurIPS'24 | Open-Vocabulary Object Detection via Language Hierarchy | [pdf]
[CCKT-Det] | ICLR'25 | Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection | [pdf]
[HD-OVD] | TMM'25 | A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection | [pdf] | [code]
[LLMDet] | CVPR'25(Highlight) | LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models | [pdf] | [code]
[VMCNet] | ArXiv'25.03 | Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection | [pdf]
[Vireo] | ArXiv'25.06 | Vireo: Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation | [pdf]
[ATAS] | ArXiv'25.06 | ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction | [pdf]
[SSEP] | ArXiv'25.11 | State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection | [pdf]

Universal Segmentation and Related Work

[Semantic-SAM] | ECCV'24 | Semantic-SAM: Segment and Recognize Anything at Any Granularity | [pdf] | [code]
[Open-Vocabulary SAM] | ECCV'24 | Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively | [pdf] | [code]
[OMG-Seg] | CVPR'24 | OMG-Seg: Is One Model Good Enough For All Segmentation? | [pdf] | [code]
[OMG-LLaVA] | NeurIPS'24 | OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding | [pdf] | [code]
[PSALM] | ECCV'24 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model | [pdf] | [code]
[HyperSeg] | ArXiv'24.11 | HyperSeg: Towards Universal Visual Segmentation with Large Language Model | [pdf] | [code]
[SAMRefiner] | ICLR'25 | SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement | [pdf] | [code]

Other Open-Vocabulary Related Work

[DENOISER] | ArXiv'24.04 | DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition | [pdf]
[O2V-mapping] | ArXiv'24.04 | O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation | [pdf]
[CMD-SE] | CVPR'24 | Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection | [pdf]
[FG-CLIP] | CBMI'24 | Is CLIP the main roadblock for fine-grained open-world perception? | [pdf] | [code]
[NegPrompt] | CVPR'24 | Learning Transferable Negative Prompts for Out-of-Distribution Detection | [pdf] | [code]
[OVFoodSeg] | CVPR'24 | OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation | [pdf]
[Fed-MP] | NAACL'24 | Open-Vocabulary Federated Learning with Multimodal Prototyping | [pdf]
[PSALM] | ECCV'24 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model | [pdf] | [code]
[OVAM] | ArXiv'24.03 | Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models | [pdf]
[CLIP-VIS] | ArXiv'24.06 | CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation | [pdf]
[RoboHop] | ICRA'24 | RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation | [pdf]
[Rein] | CVPR'24 | Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation | [pdf] | [code]
[OVMR] | CVPR'24 | OVMR: Open-Vocabulary Recognition with Multi-Modal References | [pdf] | [code]
[PartCLIPSeg] | ArXiv'24.06 | Understanding Multi-Granularity for Open-Vocabulary Part Segmentation | [pdf] | [code]
[GBC] | ArXiv'24.07 | Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions | [pdf]
[TCC] | ArXiv'24.07 | A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation | [pdf]
[OPS] | ECCV'24 | Open Panoramic Segmentation | [pdf] | [code]
[Yu et al.] | ArXiv'24.07 | PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction | [pdf]
[Oryon] | CVPR'24(Highlight) | Oryon: Open-Vocabulary Object 6D Pose Estimation | [pdf] | [code]
[GLIS] | ECCV'24 | Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection | [pdf] | [code]
[OVExp] | ArXiv'24.07 | OVExp: Open Vocabulary Exploration for Object-Oriented Navigation | [pdf] | [code]
[OV-MLVC] | ArXiv'24.07 | Open Vocabulary Multi-Label Video Classification | [pdf]
[DART] | ArXiv'24.07 | An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training | [pdf] | [code]
[NOVIC] | ArXiv'24.07 | Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion | [pdf]
[CerberusDet] | ArXiv'24.07 | CerberusDet: Unified Multi-Task Object Detection | [pdf]
[GGSD] | ArXiv'24.07 | Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation | [pdf] | [code]
[Diff2Scene] | ECCV'24 | Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models | [pdf]
[SegPoint] | ECCV'24 | SegPoint: Segment Any Point Cloud via Large Language Model | [pdf] | [code]
[LangOcc] | ArXiv'24.07 | LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering | [pdf]
[OVR] | ArXiv'24.07 | A Dataset for Open Vocabulary Temporal Repetition Counting in Videos | [pdf] | [code]
[SAM-CP] | ArXiv'24.07 | SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation | [pdf] | [code]
[OV-AVSS] | ACM MM'24(Oral) | Open-Vocabulary Audio-Visual Semantic Segmentation | [pdf] | [code]
[Open3DRF] | ArXiv'24.08 | Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space | [pdf] | [code]
[OVA-DETR] | ArXiv'24.08 | OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion | [pdf] | [code]
[OVAL] | ArXiv'24.08 | Open-vocabulary Temporal Action Localization using VLMs | [pdf] | [code]
[EMPOWER] | IROS'24 | EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution | [pdf] | [code]
[AnytimeCL] | ECCV'24(Oral) | Anytime Continual Learning for Open Vocabulary Classification | [pdf] | [code]
[OWL] | IJCV'24 | Lidar Panoptic Segmentation in an Open World | [pdf] | [code]
[DWI] | ArXiv'24.10 | Overcoming Domain Limitations in Open-vocabulary Segmentation | [pdf] | [code]
[OVT-B-Dataset] | NeurIPS'24 | OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking | [pdf] | [code]
[OpenMixer] | WACV'25 | Exploiting VLM Localizability and Semantics for Open Vocabulary Action | [pdf] | [code]
[Octree-Graph] | ArXiv'24.11 | Open-Vocabulary Octree-Graph for 3D Scene Understanding Segmentation | [pdf]
[Fun3DU] | ArXiv'24.11 | Functionality understanding and segmentation in 3D scenes | [pdf]
[MASA] | CVPR'24(Highlight) | Matching Anything by Segmenting Anything | [pdf] | [code]
[OVOW] | ArXiv'24.11 | From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects | [pdf] | [code]
[DINO-X] | ArXiv'24.11 | DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding | [pdf] | [code]
[CellSeg1] | ArXiv'24.12 | CellSeg1: Robust Cell Segmentation with One Training Image | [pdf]
[DB-SAM] | MICCAI'24(Oral) | DB-SAM: Delving into High Quality Universal Medical Image Segmentation | [pdf] | [code]
[Seg-TTO] | ArXiv'25.03 | Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation | [pdf] | [code]
[Chicken-and-egg] | ArXiv'25.02 | From Open-Vocabulary to Vocabulary-Free Semantic Segmentation | [pdf]
[Open-MeDe] | ArXiv'25.02 | Learning to Generalize without Bias for Open-Vocabulary Action Recognition | [pdf]
[Kang et al.] | ArXiv'25.03 | Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding | [pdf]
[TRACT] | ArXiv'25.03 | Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking | [pdf]
[GSNet] | AAAI'25 | Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation | [pdf]
[ComCa] | CVPR'25 | Compositional Caching for Training-free Open-vocabulary Attribute Detection | [pdf] | [code]
[BOLqGLKOLuJGI. ] | Arxiv'25.3 | Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments | [pdf]
[PRISM-0] | Arxiv'25.4 | PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks | [pdf]
[AerOSeg] | CVPR'25 EarthVision workshop | AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images | [pdf]
[NVSMask3D] | SCIA'25 | NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation | [pdf]
[RESAnything] | Arxiv'25.5 | RESAnything: Attribute Prompting for Arbitrary Referring Segmentation | [pdf] | [code]

Related Survey

Towards Open Vocabulary Learning: A Survey | [pdf]
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future | [pdf]
Image Segmentation in Foundation Model Era: A Survey | [pdf]

Feedback

If you have any suggestions or find missing papers, please don't hesitate to contact me via [email protected] or [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Open-Vocabulary-Semantic-Segmentation

Contents

Open-Vocabulary Semantic Segmentation

Fully-Supervised Open-Vocabulary Semantic Segmentation

Weakly-Supervised Open-Vocabulary Semantic Segmentation

Training-Free Open-Vocabulary Semantic Segmentation

Others

Zero-Shot Semantic Segmentation

Referring Image Segmentation

Fully-Supervised Referring Image Segmentation

Weakly-Supervised Referring Image Segmentation

Open-Vocabulary Object Detection

Universal Segmentation and Related Work

Other Open-Vocabulary Related Work

Related Survey

Feedback

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome-Open-Vocabulary-Semantic-Segmentation

Contents

Open-Vocabulary Semantic Segmentation

Fully-Supervised Open-Vocabulary Semantic Segmentation

Weakly-Supervised Open-Vocabulary Semantic Segmentation

Training-Free Open-Vocabulary Semantic Segmentation

Others

Zero-Shot Semantic Segmentation

Referring Image Segmentation

Fully-Supervised Referring Image Segmentation

Weakly-Supervised Referring Image Segmentation

Open-Vocabulary Object Detection

Universal Segmentation and Related Work

Other Open-Vocabulary Related Work

Related Survey

Feedback

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages