[Caution: This reposity is still under development mode and not cleanly documented yet. We only recommed you to use it as a reference.]
At this repository, we build our Global-Local-Transformer model on top of a selection of base scene grap generator models including KERN, Neural Motif, Stanford, etc to improve scene graph generation by leveraging Visual Commonsense
The corresponding paper was accepted at ECCV 2020 arXiv preprint arXiv:2006.09623 (2020). Alireza Zareian*, Zhecan Wang*, Haoxuan You*, Shih-Fu Chang, "Learning Visual Commonsense for Robust Scene Graph Generation", ECCV, 2020. (* co-first authors) [manuscript]
Pretraining and independent finetuning of GLAT could refer to another repository(https://github.com/ZhecanJamesWang/GLAT_Visual_Commonsense)
Tianshui Chen*, Weihao Yu*, Riquan Chen, and Liang Lin, “Knowledge-Embedded Routing Network for Scene Graph Generation”, CVPR, 2019. (* co-first authors) [manuscript]
Zellers R, Yatskar M, Thomson S, Choi Y. "Neural motifs: Scene graph parsing with global context". CVPR, 2018.
Xu D, Zhu Y, Choy CB, Fei-Fei L. "Scene graph generation by iterative message passing".CVPR 2017.
In validation/test dataset, assume there are images. For each image, a model generates top
predicted relationship triplets. As for image
, there are
ground truth relationship triplets, where
triplets are predicted successfully by the model. We can calculate:
For image , in its
ground truth relationship triplets, there are
ground truth triplets with relationship
(Except
, meaning no relationship. The number of relationship classes is
, including no relationship), where
triplets are predicted successfully by the model. In
images of validation/test dataset, for relationship
, there are
images which contain at least one ground truth triplet with this relationship. The R@X of relationship
can be calculated:
Then we can calculate: