If you find this project helpful, please consider giving it a star ⭐.
-
Open-Vocabulary Semantic Segmentation (mainly updated by @tbh3223)
-
Zero-shot Semantic Segmentation (mainly updated by @tbh3223)
-
Referring-Image-Segmentation (mainly updated by @ghost-000)
-
Open-Vocabulary Object Detection (mainly updated by @tbh3223)
-
Universal Segmentation and Related Work (mainly updated by @tbh3223)
The model is trained on fully-supervised semantic segmentation datasets with pixel-level annotations (e.g., COCO Stuff dataset).
- [LSeg] | ICLR'22 | Language-driven Semantic Segmentation |
[pdf]|[code] - [OpenSeg] | ECCV'22 | Scaling Open-vocabulary Image Segmentation with Image-level Labels |
[pdf]|[code] - [Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model |
[pdf]|[code] - [SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [MaskCLIP] | ICML'23 | Open-Vocabulary Universal Image Segmentation with MaskCLIP |
[pdf]|[code] - [OVSeg] | CVPR'23 | Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP |
[pdf]|[code] - [X-Decoder] | CVPR'23 | Generalized Decoding for Pixel, Image, and Language |
[pdf]|[code] - [SAN] | CVPR'23(Highlight) | Side Adapter Network for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [SAN] | TAPMI'23 | SAN: Side Adapter Network for Open-vocabulary Semantic Segmentation |
[pdf]|[code] - [ODISE] | CVPR'23 | Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models |
[pdf]|[code] - [FreeSeg] | CVPR'23 | FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation |
[pdf]|[code] - [OpenSeeD] | ICCV'23 | A Simple Framework for Open-Vocabulary Segmentation and Detection |
[pdf]|[code] - [GKC] | ICCV'23 | Global Knowledge Calibration for Fast Open-Vocabulary Segmentation |
[pdf] - [OPSNet] | ICCV'23 | Open-vocabulary Panoptic Segmentation with Embedding Modulation |
[pdf]|[code] - [MasQCLIP] | ICCV'23 | MasQCLIP for Open-Vocabulary Universal Image Segmentation |
[pdf] - [DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network |
[pdf]|[code] - [Li et al.] | ICCV'23 | Open-vocabulary Object Segmentation with Diffusion Models |
[pdf]|[code] - [HIPIE] | NeurIPS'23 | Hierarchical Open-vocabulary Universal Image Segmentation |
[pdf]|[code] - [FC-CLIP] | NeurIPS'23 | Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP |
[pdf]|[code] - [MAFT] | NeurIPS'23 | Learning Mask-aware CLIP Representations for Zero-Shot Segmentation |
[pdf]|[code] - [ADA] | NeurIPS'23 | Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation |
[pdf] - [Dao et al.] | TMM | Class Enhancement Losses with Pseudo Labels for Open-Vocabulary Semantic Segmentation |
[pdf] - [SELF-SEG] | ArXiv'23.12 | Self-Guided Open-Vocabulary Semantic Segmentation |
[pdf] - [OpenSD] | ArXiv'23.12 | OpenSD: Unified Open-Vocabulary Segmentation and Detection |
[pdf]|[code] - [SILC] | ArXiv'23.12 | SILC: Improving Vision Language Pretraining with Self-Distillation |
[pdf] - [CLIPSelf] | ICLR'24(Spotlight) | CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction |
[pdf]|[code] - [RENOVATE] | ArXiv'24.03 | Renovating Names in Open-Vocabulary Segmentation Benchmarks |
[pdf] - [DreamCLIP] | ECCV'24 | DreamLIP: Language-Image Pre-training with Long Captions |
[pdf]|[code] - [CAT-Seg] | CVPR'24 | CAT-Seg : Cost Aggregation for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [SED] | CVPR'24 | SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [SCAN] | CVPR'24 | Open-Vocabulary Segmentation with Semantic-Assisted Calibration |
[pdf]|[code] - [OpenTrans] | CVPR'24 | Transferable and Principled Efficiency for Open-Vocabulary Segmentation |
[pdf]|[code]) - [H-CLIP] | ArXiv'24.05 | Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation |
[pdf] - [OpenDAS] | ArXiv'24.05 | OpenDAS: Domain Adaptation for Open-Vocabulary Segmentation |
[pdf] - [USE] | CVPR'24 | USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation |
[pdf] - [EBSeg] | CVPR'24 | Open-Vocabulary Semantic Segmentation with Image Embedding Balancing |
[pdf]|[code]) - [MAFT+] | ECCV'24 | Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation |
[pdf]|[code]) - [R-Adapter] | ECCV'24 | Efficient and Versatile Robust Fine-Tuning of Zero-shot Models |
[pdf]|[code]) - [MROVSeg] | ArXiv'24.08 | MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation |
[pdf] - [FrozenSeg] | ArXiv'24.09 | FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation |
[pdf]|[code] - [GBA] | ArXiv'24.09 | Generalization Boosted Adapter for Open-Vocabulary Segmentation |
[pdf] - [SMART] | ArXiv'24.09 | Semantic Refocused Tuning for Open-Vocabulary Panoptic Segmentation |
[pdf] - [ESC-Net] | ArXiv'24.11 | Effective SAM Combination for Open-Vocabulary Semantic Segmentation |
[pdf] - [Mask-Adapter] | CVPR'25 | Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation |
[pdf]|[code] - [ERR-Seg] | ArXiv'25.01 | Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [EOV-Seg] | AAAI'25 | EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation |
[pdf]|[code] - [SemLA] | CVPR'25 | Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation |
[pdf]|[code]| (Note: new benchmark.) - [FGA-Seg] | ArXiv'25.01 | FGA-Seg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [OMTSeg] | ICIP'24 | Open-Vocabulary Panoptic Segmentation Using BERT Pre-Training of Vision-Language Multiway Transformer Model |
[pdf]|[code] - [MaskCLIP++] | ArXiv'25.03 | MaskCLIP++: High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation |
[pdf]|[code] - [OVSNet] | ICCV'25 | Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation |
[pdf] - [R-SC-CLIPSelf] | ICLR'25 | Refining CLIP's Spatial Awareness: A Visual-Centric Perspective |
[pdf] - [OpenWorldSAM] | NeurIPS'25 Spotlight | OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts |
[pdf] - [Spectrum] | AAAI'26 | Learning 3D Texture-Aware Representations for Parsing Diverse Human Clothing and Body Parts |
[pdf]|[project] - [VocAlign] | BMVC'25 | Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [SAM-MI] | ArXiv'25.11 | SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM |
[pdf] - [X-Agent] | ACM MM'25 | Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [Personalized OVSS] | ICCV'25 | Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation |
[pdf]
[text-supervised/language-supervised] The model is trained on weakly supervised datasets with only image-level annotations/captions (e.g., CC12M dataset).
- [GroupViT] | CVPR'22 | GroupViT: Semantic Segmentation Emerges from Text Supervision |
[pdf]|[code] - [ViL-Seg] | ECCV'22 | Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding |
[pdf] - [MaskCLIP+] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP |
[pdf]|[code] - [ViewCo] | ICLR'23 | Viewco: Discovering Text-supervised Segmentation Masks via Multi-view Semantic Consistency |
[pdf] - [SegCLIP] | ICML'23 | SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [CLIP-S4] | CVPR'23 | CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation |
[pdf] - [PACL] | CVPR'23 | Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning |
[pdf] - [OVSegmentor] | CVPR'23 | Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision |
[pdf]|[code] - [SimSeg] | CVPR'23 | A Simple Framework for Text-Supervised Semantic Segmentation |
[pdf]|[code] - [TCL] | CVPR'23 | Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs |
[pdf]|[code] - [SimCon] | ArXiv'23.02 | SimCon Loss with Multiple Views for Text Supervised Semantic Segmentation |
[pdf] - [Zhang et al.] | ArXiv'23.04 | Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation |
[pdf] - [ZeroSeg] | ICCV'23 | Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only |
[pdf] - [CLIPpy] | ICCV'23 | Perceptual Grouping in Contrastive Vision-Language Models |
[pdf] - [MixReorg] | ICCV'23 | MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation |
[pdf] - [CoCu] | NeurIPS'23 | Bridging Semantic Gaps for Language-Supervised Semantic Segmentation |
[pdf]|[code] - [PGSeg] | NeurIPS'23 | Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [SAM-CLIP] | ArXiv'23.10 | SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding |
[pdf] - [CLIP-DINOiser] | ArXiv'23.12 | CLIP-DINOiser: Teaching CLIP a few DINO tricks |
[pdf]|[code] - [TagAlign] | ArXiv'23.12 | TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification |
[pdf]|[code] - [S-Seg] | ArXiv'24.01 | Exploring Simple Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [CLIPSelf] | ICLR'24(Spotlight) | CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction |
[pdf]|[code] - [Uni-OVSeg] | ArXiv'24.02 | Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision |
[pdf]|[code] - [MGCA] | ArXiv'24.03 | Multi-Grained Cross-modal Alignment for Learning Open-vocabulary Semantic Segmentation from Text Supervision |
[pdf] - [TTD] | ArXiv'24.04 | TTD: Text-Tag Self-Distillation Enhancing Image-Text Alignment in CLIP to Alleviate Single Tag Bias |
[pdf]|[code] - [CoDe] | CVPR'24 | Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation |
[pdf] - [LLM-Supervision] | ArXiv'24.03 | Training-Free Semantic Segmentation via LLM-Supervision |
[pdf] - [ProxyCLIP] | ECCV'24 | ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation |
[pdf]|[code] - [LPOSS] | CVPR'25 | LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation |
[pdf]|[code] - [SynSeg] | AAAI'26 | SynSeg: Feature Synergy for Multi-Category Contrastive Learning in Open-Vocabulary Semantic Segmentation |
[pdf] - [RF-CLIP] | AAAI'26 | Target Refocusing via Attention Redistribution for Open-Vocabulary Semantic Segmentation: An Explainability Perspective |
[pdf]|[code]
The model is modified from the off-the-shelf large models (e.g., CLIP, Diffusion models) without an additional training phase. Note that, the large models have already been trained with some datasets (e.g., image-caption datasets).
- [MaskCLIP] | ECCV'22(Oral) | Extract Free Dense Labels from CLIP |
[pdf]|[code] - [ReCo] | NeurIPS'22 | ReCo: Retrieve and Co-segment for Zero-shot Transfer |
[pdf]|[code] - [CLIP Surgery] | ArXiv'23.04 | CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks |
[pdf]|[code] - [OVDiff] | ArXiv'23.06 | Diffusion Models for Zero-Shot Open-Vocabulary Segmentation |
[pdf] - [DiffSegmenter] | ArXiv'23.09 | Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter |
[pdf]|[code] - [IPSeg] | IJCV'24 | Towards Training-free Open-world Segmentation via Image Prompting Foundation Models |
[pdf] - [SCLIP] | ArXiv'23.12 | SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference |
[pdf] - [GEM] | CVPR'24 | Grounding Everything: Emerging Localization Properties in Vision-Language Transformers |
[pdf]|[code] - [CLIP-DIY] | WACV'24 | CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free |
[pdf] - [FOSSIL] | WACV'24 | FOSSIL: Free Open-Vocabulary Semantic Segmentation through Synthetic References Retrieval |
[pdf] - [TagCLIP] | AAAI'24 | TagCLIP: A Local-to-Global Framework to Enhance Open-VocabularyMulti-Label Classification of CLIP Without Training |
[pdf]|[code] - [EmerDiff] | ICLR'24 | EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models |
[pdf]|[code] - [FreeSeg-Diff] | ArXiv'24.03 | FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models |
[pdf]|[code] - [MaskDiffusion] | ArXiv'24.03 | MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation |
[pdf]|[code] - [TAG] | ArXiv'24.03 | TAG: Guidance-free Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [Sun et al.] | ArXiv'24.04 | Training-Free Semantic Segmentation via LLM-Supervision |
[pdf] - [NACLIP] | ArXiv'24.04 | Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation|
[pdf]|[code] - [PnP-OVSS] | CVPR'24 | Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models |
[pdf]|[code] - [CaR] | CVPR'24 | CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
[pdf]|[code] - [Wang et al.] | CVPR'24 | Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [FreeDA] | CVPR'24 | Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation|
[pdf]|[code] - [Yang et al.] | ArXiv'24.05 | Tuning-free Universally-Supervised Semantic Segmentation |
[pdf] - [CLIPTrase] | ECCV'24 | Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation |
[pdf]|[code] - [ClearCLIP] | ECCV'24 | ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference |
[pdf]|[code] - [ProxyCLIP] | ECCV'24 | ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation |
[pdf]|[code] - [LaVG] | ECCV'24 | In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [ITACLIP] | ArXiv'24.11 | ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements |
[pdf]|[code] - [Trident] | ArXiv'24.11 | Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation |
[pdf]|[code] - [CorrCLIP] | ArXiv'24.11 | CorrCLIP: Reconstructing Correlations in CLIP with Off-the-Shelf Foundation Models for Open-Vocabulary Semantic Segmentation |
[pdf] - [CLIPer] | ArXiv'24.11 | CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [ResCLIP] | ArXiv'24.11 | ResCLIP: Residual Attention for Training-free Dense Vision-language Inference |
[pdf]|[code] - [SC-CLIP] | ArXiv'24.11 | Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation |
[pdf]|[code] - [Talk2DINO] | ArXiv'24.11 | Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation |
[pdf]|[code] - [CASS] | CVPR'25 | Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [ReME] | ICCV'25 | ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation |
[pdf]|[code] - [SFP] | ICCV'25 | Feature Purification Matters: Suppressing Outlier Propagation for Training-Free Open-Vocabulary Semantic Segmentation |
[pdf]|[code] - [FSA] | ICCV'25 | Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation |
[pdf]|[code] - [FreeCP] | ICCV'25 | Training-Free Class Purification for Open-Vocabulary Semantic Segmentation |
[pdf]|[code]
- [EntitySeg] | ArXiv'23.11 | Rethinking Evaluation Metrics of Open-Vocabulary Segmentation |
[pdf]|[code] - [PixelCLIP] | NeurIPS'24 | Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels |
[pdf]|[code]
Different from open-vocabulary segmentation (cross-dataset), zero-shot methods split each dataset to seen classes and unseen classes.
- [ZegFormer] | CVPR'22 | ZegFormer: Decoupling Zero-Shot Semantic Segmentation |
[pdf]|[code] - [Xu et al.] | ECCV'22 | A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model |
[pdf]|[code] - [ZegCLIP] | CVPR'23 | ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation |
[pdf]|[code] - [PADing] | CVPR'23 | Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation |
[pdf]|[code] - [DeOP] | ICCV'23 | Open Vocabulary Semantic Segmentation with Decoupled One-Pass Network |
[pdf]|[code] - [SPT] | AAAI'24 | Spectral Prompt Tuning: Unveiling Unseen Classes for Zero-Shot Semantic Segmentation |
[pdf]|[code] - [Chen et al.] | ArXiv'24.02 | Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation |
[pdf] - [LDVC] | ArXiv'24.03 | Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation |
[pdf] - [OTSeg] | ArXiv'24.03 | OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation |
[pdf] - [Cascade-CLIP] | ICML'24 | Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation |
[pdf]|[code] - [SimZSS] | ArXiv'24.07 | A Simple Framework for Open-Vocabulary Zero-Shot Segmentation |
[pdf] - [CaR] | CVPR'24 | CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor |
[pdf]|[code]
- [CARIS] | ACM MM'23 | CARIS: Context-Aware Referring Image Segmentation |
[pdf]|[code] - [BKINet] | TMM'23 | Bilateral Knowledge Interaction Network for Referring Image Segmentation |
[pdf]|[code] - [Group-RES] | ICCV'23 | Advancing Referring Expression Segmentation Beyond Single Image |
[pdf]|[code] - [RIS-DMMI] | ICCV'23 | Beyond One-to-One: Rethinking the Referring Image Segmentation |
[pdf]|[code] - [ETRIS] | ICCV'23 | Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation |
[pdf]|[code] - [SEEM] | ArXiv'23.04 | Segment Everything Everywhere All at Once |
[pdf]|[code]
- [Strudel et al.] | ArXiv'22.05 | Weakly-supervised segmentation of referring expressions |
[pdf] - [Kim et al.] | ICCV'23 | Shatter and Gather: Learning Referring Image Segmentation with Text Supervision |
[pdf]|[code] - [TRIS] | ICCV'23 | Referring Image Segmentation Using Text Supervision |
[pdf]|[code] - [Jungbeom Lee et al.] | ICCV'23 | Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency |
[pdf] - [PPT] | CVPR'24 | Curriculum Point Prompting for Weakly-Supervised Referring Segmentation |
[pdf]
- [RO-ViT] | CVPR'23(Highlight) | Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers |
[pdf]|[code] - [CAT] | CVPR'23 | CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection |
[pdf]|[code] - [DetCLIPv2] | CVPR'23 | DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment |
[pdf] - [CondHead] | CVPR'23 | Learning to Detect and Segment for Open Vocabulary Object Detection |
[pdf] - [CORA] | CVPR'23 | CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching |
[pdf]|[code] - [ovdet] | CVPR'23 | Aligning Bag of Regions for Open-Vocabulary Object Detection |
[pdf]|[code] - [OADP] | CVPR'23 | Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection |
[pdf]|[code] - [F-VLM] | ICLR'23 | F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models |
[pdf]|[code] - [mm-ovod] | ICML 2023 | Multi-Modal Classifiers for Open-Vocabulary Object Detection |
[pdf]|[code] - [SGDN] | ArXiv'23.07 | Open-Vocabulary Object Detection via Scene Graph Discovery |
[pdf] - [MMC-Det] | ArXiv'23.08 | Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection |
[pdf] - [SAS-Det] | CVPR'24 | Taming Self-Training for Open-Vocabulary Object Detection |
[pdf]|[code] - [DITO] | ArXiv'23.09 | Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection |
[pdf]|[code] - [EdaDet] | ICCV'23 | EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment |
[pdf]|[code] - [LP-OVOD] | WACV'24 | LP-OVOD: Open-Vocabulary Object Detection by Linear Probing |
[pdf]|[code] - [DST-Det] | ArXiv'23.10 | DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection |
[pdf]|[code] - [CoDet] | NeurIPS'23 | CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection |
[pdf]|[code] - [PLAC] | ArXiv'23.12 | Learning Pseudo-Labeler beyond Noun Concepts for Open-Vocabulary Object Detection |
[pdf] - [Sambor] | ArXiv'23.12 | Boosting Segment Anything Model Towards Open-Vocabulary Learning |
[pdf]|[code] - [DVDet] | ICLR'24 | LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors |
[pdf] - [DetCLIPv3] | CVPR'24 | DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection |
[pdf] - [AggDet] | ArXiv'24.04 | Training-free Boost for Open-Vocabulary Object Detection with Confidence Aggregation |
[pdf] - [RALF] | CVPR'24 | Retrieval-Augmented Open-Vocabulary Object Detection |
[pdf]|[code] - [Chhipa et al.] | ArXiv'24.06 | Investigating Robustness of Open-Vocabulary Foundation Object Detectors under Distribution Shifts |
[pdf] - [SHiNe] | CVPR'24(Highlight) | SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection |
[pdf]|[code] - [RTGen] | ArXiv'24.06 | RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection |
[pdf]|[code] - [LBP] | CVPR'24 | Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection |
[pdf] - [YOLO-World] | CVPR'24 | Real-Time Open-Vocabulary Object Detection |
[pdf]|[code] - [OV-DINO] | ArXiv'24.07 | Unified Open-Vocabulary Detection with Language-Aware Selective Fusion |
[pdf]|[code] - [OVLW-DETR] | ArXiv'24.07 | OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer |
[pdf]|[code] - [LaMI-DETR] | ECCV'24 | LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction |
[pdf]|[code] - [MarvelOVD] | ECCV'24 | MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection |
[pdf]|[code] - [DetLH] | NeurIPS'24 | Open-Vocabulary Object Detection via Language Hierarchy |
[pdf] - [CCKT-Det] | ICLR'25 | Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection |
[pdf] - [HD-OVD] | TMM'25 | A Hierarchical Semantic Distillation Framework for Open-Vocabulary Object Detection |
[pdf]|[code] - [LLMDet] | CVPR'25(Highlight) | LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models |
[pdf]|[code] - [VMCNet] | ArXiv'25.03 | Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection |
[pdf] - [Vireo] | ArXiv'25.06 | Vireo: Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation |
[pdf] - [ATAS] | ArXiv'25.06 | ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction |
[pdf] - [SSEP] | ArXiv'25.11 | State and Scene Enhanced Prototypes for Weakly Supervised Open-Vocabulary Object Detection |
[pdf]
- [Semantic-SAM] | ECCV'24 | Semantic-SAM: Segment and Recognize Anything at Any Granularity |
[pdf]|[code] - [Open-Vocabulary SAM] | ECCV'24 | Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively |
[pdf]|[code] - [OMG-Seg] | CVPR'24 | OMG-Seg: Is One Model Good Enough For All Segmentation? |
[pdf]|[code] - [OMG-LLaVA] | NeurIPS'24 | OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding |
[pdf]|[code] - [PSALM] | ECCV'24 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model |
[pdf]|[code] - [HyperSeg] | ArXiv'24.11 | HyperSeg: Towards Universal Visual Segmentation with Large Language Model |
[pdf]|[code] - [SAMRefiner] | ICLR'25 | SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement |
[pdf]|[code]
- [DENOISER] | ArXiv'24.04 | DENOISER: Rethinking the Robustness for Open-Vocabulary Action Recognition |
[pdf] - [O2V-mapping] | ArXiv'24.04 | O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation |
[pdf] - [CMD-SE] | CVPR'24 | Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection |
[pdf] - [FG-CLIP] | CBMI'24 | Is CLIP the main roadblock for fine-grained open-world perception? |
[pdf]|[code] - [NegPrompt] | CVPR'24 | Learning Transferable Negative Prompts for Out-of-Distribution Detection |
[pdf]|[code] - [OVFoodSeg] | CVPR'24 | OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation |
[pdf] - [Fed-MP] | NAACL'24 | Open-Vocabulary Federated Learning with Multimodal Prototyping |
[pdf] - [PSALM] | ECCV'24 | PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model |
[pdf]|[code] - [OVAM] | ArXiv'24.03 | Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models |
[pdf] - [CLIP-VIS] | ArXiv'24.06 | CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation |
[pdf] - [RoboHop] | ICRA'24 | RoboHop: Segment-based Topological Map Representation for Open-World Visual Navigation |
[pdf] - [Rein] | CVPR'24 | Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation |
[pdf]|[code] - [OVMR] | CVPR'24 | OVMR: Open-Vocabulary Recognition with Multi-Modal References |
[pdf]|[code] - [PartCLIPSeg] | ArXiv'24.06 | Understanding Multi-Granularity for Open-Vocabulary Part Segmentation |
[pdf]|[code] - [GBC] | ArXiv'24.07 | Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions |
[pdf] - [TCC] | ArXiv'24.07 | A Study of Test-time Contrastive Concepts for Open-world, Open-vocabulary Semantic Segmentation |
[pdf] - [OPS] | ECCV'24 | Open Panoramic Segmentation |
[pdf]|[code] - [Yu et al.] | ArXiv'24.07 | PanopticRecon: Leverage Open-vocabulary Instance Segmentation for Zero-shot Panoptic Reconstruction |
[pdf] - [Oryon] | CVPR'24(Highlight) | Oryon: Open-Vocabulary Object 6D Pose Estimation |
[pdf]|[code] - [GLIS] | ECCV'24 | Global-Local Collaborative Inference with LLM for Lidar-Based Open-Vocabulary Detection |
[pdf]|[code] - [OVExp] | ArXiv'24.07 | OVExp: Open Vocabulary Exploration for Object-Oriented Navigation |
[pdf]|[code] - [OV-MLVC] | ArXiv'24.07 | Open Vocabulary Multi-Label Video Classification |
[pdf] - [DART] | ArXiv'24.07 | An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training |
[pdf]|[code] - [NOVIC] | ArXiv'24.07 | Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion |
[pdf] - [CerberusDet] | ArXiv'24.07 | CerberusDet: Unified Multi-Task Object Detection |
[pdf] - [GGSD] | ArXiv'24.07 | Open Vocabulary 3D Scene Understanding via Geometry Guided Self-Distillation |
[pdf]|[code] - [Diff2Scene] | ECCV'24 | Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models |
[pdf] - [SegPoint] | ECCV'24 | SegPoint: Segment Any Point Cloud via Large Language Model |
[pdf]|[code] - [LangOcc] | ArXiv'24.07 | LangOcc: Self-Supervised Open Vocabulary Occupancy Estimation via Volume Rendering |
[pdf] - [OVR] | ArXiv'24.07 | A Dataset for Open Vocabulary Temporal Repetition Counting in Videos |
[pdf]|[code] - [SAM-CP] | ArXiv'24.07 | SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation |
[pdf]|[code] - [OV-AVSS] | ACM MM'24(Oral) | Open-Vocabulary Audio-Visual Semantic Segmentation |
[pdf]|[code] - [Open3DRF] | ArXiv'24.08 | Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space |
[pdf]|[code] - [OVA-DETR] | ArXiv'24.08 | OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion |
[pdf]|[code] - [OVAL] | ArXiv'24.08 | Open-vocabulary Temporal Action Localization using VLMs |
[pdf]|[code] - [EMPOWER] | IROS'24 | EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution |
[pdf]|[code] - [AnytimeCL] | ECCV'24(Oral) | Anytime Continual Learning for Open Vocabulary Classification |
[pdf]|[code] - [OWL] | IJCV'24 | Lidar Panoptic Segmentation in an Open World |
[pdf]|[code] - [DWI] | ArXiv'24.10 | Overcoming Domain Limitations in Open-vocabulary Segmentation |
[pdf]|[code] - [OVT-B-Dataset] | NeurIPS'24 | OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking |
[pdf]|[code] - [OpenMixer] | WACV'25 | Exploiting VLM Localizability and Semantics for Open Vocabulary Action |
[pdf]|[code] - [Octree-Graph] | ArXiv'24.11 | Open-Vocabulary Octree-Graph for 3D Scene Understanding Segmentation |
[pdf] - [Fun3DU] | ArXiv'24.11 | Functionality understanding and segmentation in 3D scenes |
[pdf] - [MASA] | CVPR'24(Highlight) | Matching Anything by Segmenting Anything |
[pdf]|[code] - [OVOW] | ArXiv'24.11 | From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects |
[pdf]|[code] - [DINO-X] | ArXiv'24.11 | DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding |
[pdf]|[code] - [CellSeg1] | ArXiv'24.12 | CellSeg1: Robust Cell Segmentation with One Training Image |
[pdf] - [DB-SAM] | MICCAI'24(Oral) | DB-SAM: Delving into High Quality Universal Medical Image Segmentation |
[pdf]|[code] - [Seg-TTO] | ArXiv'25.03 | Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation |
[pdf]|[code] - [Chicken-and-egg] | ArXiv'25.02 | From Open-Vocabulary to Vocabulary-Free Semantic Segmentation |
[pdf] - [Open-MeDe] | ArXiv'25.02 | Learning to Generalize without Bias for Open-Vocabulary Action Recognition |
[pdf] - [Kang et al.] | ArXiv'25.03 | Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding |
[pdf] - [TRACT] | ArXiv'25.03 | Attention to Trajectory: Trajectory-Aware Open-Vocabulary Tracking |
[pdf] - [GSNet] | AAAI'25 | Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation |
[pdf] - [ComCa] | CVPR'25 | Compositional Caching for Training-free Open-vocabulary Attribute Detection |
[pdf]|[code] - [BOLqGLKOLuJGI. ] | Arxiv'25.3 | Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments |
[pdf] - [PRISM-0] | Arxiv'25.4 | PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks |
[pdf] - [AerOSeg] | CVPR'25 EarthVision workshop | AerOSeg: Harnessing SAM for Open-Vocabulary Segmentation in Remote Sensing Images |
[pdf] - [NVSMask3D] | SCIA'25 | NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation |
[pdf] - [RESAnything] | Arxiv'25.5 | RESAnything: Attribute Prompting for Arbitrary Referring Segmentation |
[pdf]|[code]
- Towards Open Vocabulary Learning: A Survey |
[pdf] - A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future |
[pdf] - Image Segmentation in Foundation Model Era: A Survey |
[pdf]
If you have any suggestions or find missing papers, please don't hesitate to contact me via [email protected] or [email protected].