[AAAI 2024]Weakly Supervised Multimodal Affordance Grounding for Egocentric Images

Paper

Weakly Supervised Multimodal Affordance Grounding for Egocentric Images(AAAI 2024)

Link: https://doi.org/10.1609/aaai.v38i6.28451

Appendix

Abstract:

To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict human-object interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images. To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances. We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach.

Requirements

We run in the following environment:

A NVIDIA GeForce RTX 3090
Python(3.8)
Pytorch(1.10.0)

Required pre-trained models

model for Dino_vit(No need to download separately, the code is already included)
model for text_enconder(clip): You can find it here

Start

git clone https://github.com/xulingjing88/WSMA.git
cd WSMA

Before training, you need to preprocess the data

Seen, Unseen(from AGD20K): You can find it here
HICO-IIF: You can find it here or google cloud here

python preprocessing.py

Set 'data_root' to the path of the dataset, 'divide' to the dataset name (Seen or Unseen or HICO-IIF), and then you can start training by running train.py.

python train.py

Acknowledgements

We would like to express our gratitude to the following repositories for their contributions and inspirations: Cross-View-AG, LOCATE, Dino, CLIP.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
AIM		AIM
docs		docs
images		images
models		models
save_models		save_models
utils		utils
README.md		README.md
bpe_simple_vocab_16e6.txt.gz		bpe_simple_vocab_16e6.txt.gz
datatest.py		datatest.py
datatrain.py		datatrain.py
preprocessing.py		preprocessing.py
simple_tokenizer.py		simple_tokenizer.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[AAAI 2024]Weakly Supervised Multimodal Affordance Grounding for Egocentric Images

Paper

Requirements

Required pre-trained models

Start

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

xulingjing88/WSMA

Folders and files

Latest commit

History

Repository files navigation

[AAAI 2024]Weakly Supervised Multimodal Affordance Grounding for Egocentric Images

Paper

Requirements

Required pre-trained models

Start

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages