The official repository for “Image Captioning via Dynamic Path Customization”.
Dynamic Transformer Network (DTNet) is a model to genrate discriminative yet accurate captions, which dynamically assigns customized paths to different samples.
The framework of the proposed Dynamic Transformer Network (DTNet)
The detailed architectures of different cells in the spatial and channel routing space.
- 2023.09.28: Released code
Please refer to meshed-memory-transformer
- Annotation. Download the annotation file annotation.zip. Extarct and put it in the project root directory.
- Feature. You can download our ResNeXt-101 feature (hdf5 file) here. Acess code: jcj6.
- evaluation. Download the evaluation tools here. Acess code: jcj6. Extarct and put it in the project root directory.
There are five kinds of keys in our .hdf5 file. They are
['%d_features' % image_id]: region features (N_regions, feature_dim)['%d_boxes' % image_id]: bounding box of region features (N_regions, 4)['%d_size' % image_id]: size of original image (for normalizing bounding box), (2,)['%d_grids' % image_id]: grid features (N_grids, feature_dim)['%d_mask' % image_id]: geometric alignment graph, (N_regions, N_grids)
The feature extraction can be followed as here
python train.py --exp_name DTNet --batch_size 50 --rl_batch_size 100 --workers 4 --head 8 --warmup 10000 --features_path /home/data/coco_grid_feats2.hdf5 --annotation /home/data/m2_annotations --logs_folder tensorboard_logspython eval.py --batch_size 50 --exp_name DTNet --features_path /home/data/coco_grid_feats2.hdf5 --annotation /home/data/m2_annotations --ckpt_path your_model_path
Comparisons with SOTAs on the Karpathy test split.
Examples of captions generated by Transformer and DTNet.
Images and the corresponding number of passed cells.
- Thanks the meshed-memory-transformer.
- Thanks the amazing work of grid-feats-vqa.
@ARTICLE{ma2024image,
author={Ma, Yiwei and Ji, Jiayi and Sun, Xiaoshuai and Zhou, Yiyi and Hong, Xiaopeng and Wu, Yongjian and Ji, Rongrong},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={Image Captioning via Dynamic Path Customization},
year={2024},
volume={},
number={},
pages={1-15},
keywords={Routing;Visualization;Transformers;Adaptation models;Task analysis;Feature extraction;Semantics;Dynamic network;image captioning;input-sensitive;transformer},
doi={10.1109/TNNLS.2024.3409354}}
