Skip to content

alphavideo.model.tubeTK

Alpha-Video edited this page May 10, 2020 · 2 revisions

Introduction

TubeTK is an end-to-end one training stage model for video multi-object tracking. This is the official implementation of paper "TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model". The detailed intact training and inference scripts can be found here.

API

FUNCTION alphavideo.model.tubeTK(num_class=1, pretrain=True)

  • Build the tubeTK model for multi-object tracking.

  • Parameters:

    • num_class (int): Number of object categories the model tracks. At present, num_class=1 is usually used for pedestrian or car tracking. We only provide pretrained model for num_class=1. By default num_class=1.

    • pretrain (bool): Whether to load weight pretrained on MOT16. We only provide pretrained model for num_class=1. By default, ```pretrain=True'''.

  • Input:

    • img (tensor): Input frames of the target video. Its shape is (). By default, .

    • img_meta (list of dic): Meta data for input frames. It is a list of dic. Each dic is corresponding for a video in the batch. The shape of dic is:

      {'img_shape': [8, 1080, 1920],
       'value_range': 1,
       'pad_percent': [1,1]}
      

      where img_shape indicates the original shape of the input clip before transformation like resize or padding. It is used for mapping the predicted coordinates to original space. value_range is the value the model used to present the coordinate. For example, if value_range=2, (2, 2) will be the coordinate of bottom right corner. pad_percent indicated the padding percent of the input frames. For example, if an image with original shape of (80, 100) is padding to (100, 100) for input, the pad_percent should be [1, 0.8].

    • gt_tubes (list of tensor): Only needed when training. It is a list of tensor. Each tensor is corresponding for a video in the batch. The shape of tensor is n, 15, representing n Btubes which is expressed by 15 coordinates. For example, if a Btube's is , is , and is , then the input coordinates should be .

    • gt_labels (list of tensor): Only needed when training. It is a list of tensor. Each tensor is corresponding for a video in the batch. The shape of tensor is n, num_class, representing n one-hot labels.

    • return_loss (bool): A flag to control whether to train the model and return the loss (True) or conduct inference process and return Btube list (False).

  • Output:

    • When return_loss=True: The output is a dic containing multiple loss:
        dict(
              loss_cls=loss_cls,
              loss_reg=loss_reg,
              loss_centerness=loss_centerness,
              mid_iou_loss=mid_iou_loss)
    • When return_loss=False: The output is a list of results. Each element in the list is the results of one video in the batch. The results is also a list: [tubes, labels, others]. tubes is a list of Btubes with shape [n ,15] just as the input gt_tubes. labels is a list of lables with shape [n, num_class] just as the input gt_labels. others is some intermediate results. For details, please see detailed train and evaluation scripts here.
  • Example:

# model
model = alphavideo.model.tubeTK(pretrain=True)
print(model)

# input
images = torch.zeros((1, 3, 8, 896, 1152))
image_meta = [{'img_shape': [8, 1080, 1920],
               'value_range': 1,
               'pad_percent': [1, 1]}]
gt_tubes = [torch.tensor([[0, 0, 0.1, 0.1, 3, 2, 0, 0, 0, 0, -2, 0, 0, 0, 0]])]
gt_labels = [torch.ones((1, 1))]
results = model(images, image_meta, return_loss=False, gt_tubes=gt_tubes, gt_labels=gt_labels)
print(results)

Citation

@inproceedings{pang2020tubeTK,
  title={TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model},
  author={Pang, Bo and Li, Yizhuo and Zhang, Yifan and Li, Muchen and Lu, Cewu},
  booktitle={CVPR},
  year={2020}
}

alphavideo

alphavideo.model

Clone this wiki locally