[CVPR 2024] VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning & VSCode-v2: Dynamic Prompt Learning for General Visual Salient and Camouflaged Object Detection with Two-Stage Optimization
Ziyang Luo, Nian Liu, Wangbo Zhao, Xuguang Yang, Dingwen Zhang, Deng-Ping Fan, Fahad Khan, Junwei Han
Approach: [arxiv Paper]
The extension work of VSCode has been accepted by TPAMI. The codes, models, and results can be found in this repository. The final, camera-ready TPAMI version of the paper will be provided soon. The VSCode-v2 implementation is located in the VSCode2 folder, which contains all training, testing, and evaluation code.
[VSCode] We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD.
[VSCode-v2] Salient object detection (SOD) and camouflaged object detection (COD) are related but distinct binary mapping tasks, each involving multiple modalities that share commonalities while maintaining unique characteristics. Existing approaches often rely on complex, task-specific architectures, leading to redundancy and limited generalization. Our previous work, VSCode, introduced a generalist model that effectively handles four SOD tasks and two COD tasks. VSCode leveraged VST as its foundation model and incorporated 2D prompts within an encoder-decoder framework to capture domain and task-specific knowledge, utilizing a prompt discrimination loss to optimize the model. Building upon the proven effectiveness of our previous work VSCode, we identify opportunities to further strengthen generalization capabilities through focused modifications in model design and optimization strategy. To unlock this potential, we propose VSCode-v2, an extension that introduces a Mixture of Prompt Experts (MoPE) layer to generate adaptive prompts. We also redesign the training process into a two stage approach: first learning shared features across tasks, then capturing specific characteristics. To preserve knowledge during this process, we incorporate distillation from our conference version model. Furthermore, we propose a contrastive learning mechanism with data augmentation to strengthen the relationships between prompts and feature representations. VSCode-v2 demonstrates balanced performance improvements across six SOD and COD tasks. Moreover, VSCode-v2 effectively handles various multimodal inputs and exhibits zero-shot generalization capability to novel tasks, such as RGB-D Video SOD.
Pytorch
For RGB SOD and RGB-D SOD, we employ the following datasets to train our model concurrently: the training set of DUTS for RGB SOD , the training sets of NJUD, NLPR, and DUTLF-Depth for RGB-D SOD.
For testing the RGB SOD task, we use DUTS, ECSSD, HKU-IS, PASCAL-S, DUT-O, and SOD, while STERE, NJUD, NLPR, DUTLF-Depth, SIP, and ReDWeb-S datasets are employed for testing the RGB-D SOD task. You can directly download these datasets by following [VST].
We employ the training set of VT5000 to train our model, and VT821, VT1000, and the testing of VT5000 are utilized for testing (from link). Please download the corresponding contour maps from [baidu,PIN:m9ht] for VT5000 and place them into the RGBT folder.
For VSOD, we employ six widely used benchmark datasets: DAVIS, FBMS, ViSal, SegV2, DAVSOD-Easy, and DAVSOD-Normal (from link). Please download corresponding contour maps and optical flow from [baidu,PIN:jyzy] and [[baidu[(https://pan.baidu.com/s/1IUPH8jG-t2ZlK1Acw1W1oA),PIN:bxi7] for DAVIS and DAVSOD, and put it into Video folder. For VSOD and VCOD tasks, we follow the common practice of utilizing Flownet2.0 as the optical flow extractor due to its consistently strong performance.
Regarding RGB COD, three extensive benchmark datasets are considered, including COD10K, CAMO, and NC4K. Please download the corresponding contour maps from [baidu,PIN:gkq2] and [baidu,PIN:zojp] for COD10K and CAMO, and put it into COD/rgb/ folder.
For VCOD, we utilize two widely accepted benchmark datasets: CAD and MoCA-Mask (from link). Please download the corresponding contour maps and optical flow from [baidu,PIN:tjah] for MoCA-Mask, and put it into COD/rgbv/ folder.
The total dataset folder should like this:
-- Data
| -- RGB
| | -- DUTS
| | -- ECSSD
...
| -- RGBD
| | -- NJUD
| | -- NLPR
...
| -- RGBT
| | -- VT821
| | -- | RGB
| | -- | GT
| | -- | T
| | -- VT5000
| | | -- Train
| | | -- | RGB
| | | -- | GT
| | | -- | T
| | | -- | Contour
| | | -- Test
...
| -- Video
| | -- Train
| | | -- DAVSOD
| | | | -- select_0043
| | | | -- | RGB
| | | | -- | GT
| | | | -- | Flow
| | | | -- | Contour
| | -- Test
| | | -- DAVIS16
| | | | -- blackswan
| | | | -- | Frame
| | | | -- | GT
| | | | -- | OF_FlowNet2
...
| -- COD
| | -- rgb
| | | -- Train
| | | | -- CAMO
| | | | -- | RGB
| | | | -- | GT
| | | | -- | Contour
| | | -- Test
| | | | -- CAMO
| | | | -- | RGB
| | | | -- | GT
...
| | -- rgbv
| | | -- Train
| | | | -- MoCA_Mask
| | | | | -- TrainDataset_per_sq
| | | | | | -- crab
| | | | | | -- | Imgs
| | | | | | -- | GT
| | | | | | -- | Flow
| | | | | | -- | Contour
| | | -- Test
| | | | -- MoCA_Mask
| | | | | | -- arctic_fox
| | | | | | -- | Imgs
| | | | | | -- | GT
| | | | | | -- | Flow
...
For VSCode-v2, we introduce concatenated augmentation data for the contrastive loss. We recommend that readers generate their own datasets as well. For reference, we list the datasets we generated, including RGB_pseudo [baidu,PIN:j7q5], RGBD_pseudo [baidu,PIN:3k1p], RGBT_pseudo [baidu,PIN:kj8b], RGBV_pseudo [baidu,PIN:kprh], CODRGB_pseudo [baidu,PIN:2if2] and CODRGBV_pseudo [baidu,PIN:di5c].
Run python train_test_eval.py --Training True --Testing True --Evaluation True for training, testing, and evaluation which is similar to VST.
Please be aware that our evaluation tool may exhibit some differences from Zhao Zhang for VSOD, as certain ground truth maps may not be binarized.
Due to the limited storage capacity of my Google Drive, I am unable to upload additional files there. If you can only access the data via Google Drive and are unable to use Baidu Cloud, please contact me by email. ([email protected]).
| Name | Backbone | Params | Weight |
|---|---|---|---|
| VSCode-T | Swin-T | 54.09 | [baidu,PIN:mmn1]/[Geogle Drive] |
| VSCode-S | Swin-S | 74.72 | [baidu,PIN:8jig]/[Geogle Drive] |
| VSCode-B | Swin-B | 117.41 | [baidu,PIN:kidl]/[Geogle Drive] |
| VSCode-v2-T(stage1) | Swin-T | - | [baidu,PIN:wexs] |
| VSCode-v2-S(stage1) | Swin-S | - | [baidu,PIN:gnma] |
| VSCode-v2-T(stage2) | Swin-T | 69.8 | [baidu,PIN:8imx] |
| VSCode-v2-S(stage2) | Swin-S | 90.4 | [baidu,PIN:4r7b] |
We offer the prediction maps of VSCode-T [baidu,PIN:gsvf]/ [Geogle Drive] , VSCode-S [baidu,PIN:ohf5]/[Geogle Drive], VSCode-B [baidu,PIN:uldc]/[Geogle Drive], VSCode-v2-T [baidu,PIN:x787]] , VSCode-v2-S [baidu,PIN:v2i6]] at this time.
If you use VSCode or VSCode-v2 in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@inproceedings{luo2024vscode,
title={Vscode: General visual salient and camouflaged object detection with 2d prompt learning},
author={Luo, Ziyang and Liu, Nian and Zhao, Wangbo and Yang, Xuguang and Zhang, Dingwen and Fan, Deng-Ping and Khan, Fahad and Han, Junwei},
booktitle={Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
pages={17169--17180},
year={2024}
}



