The object masks are provided in JSON files as RLE encoded strings. We use pycocotools to encode/decode these masks.
The JSON file contains a dictionary which is organized as follows:
sequences:
- width: <int> # image_dims
height: <int>
id: <int>
seq_name: <str>
dataset: <str> # LaSOT, BDD, ArgoVerse, HACS, ...
fps: <int>
all_image_paths:
- <str>
...
annotated_image_paths:
- <str>
...
neg_category_ids:
- <int>
...
not_exhaustive_category_ids:
- <int>
...
track_category_ids:
... # see below for details
segmentations:
... # see below for details
categories:
- id: <int>
name: <str>
synset: <str>
def: <str>
synonyms:
- <str>
...
...
...
split: <str> # train/val/test
categories: List of all object categories in the dataset. We use the same category IDs as the LVIS dataset.sequences: List of all video sequences in the datasets. Each list entry is a dictionary with basic attributes e.g. image size, video ID, etc, and also the mask annotations for the object tracks in this video.split: which split (train/val/test) the annotations belong to.
The track_category_ids is a dict which conveys the category ID for each object track in that sequence:
track_category_ids:
track_id: category_id
...
The segmentations field is a list with one entry per annotated video frame. Each list element is a dict with track IDs as keys and encoded masks and other attributes as values. In the example below, '1' and '2' are the track IDs
segmentations:
- 1: # first frame
rle: <str>
is_gt: <bool>
score: <float>
bbox: # only present in `first_frame_annotations` file
- x coord <int>
- y coord <int>
- width <int>
- height <int>
point: # only present in `first_frame_annotations` file
- x coord <int>
- y coord <int>
2:
- 1: ... # second frame
2: ...
- 1: ... # third frame
2: ...
For the training set, we adopted a semi-automated workflow for annotating temporally dense object masks. The is_gt field conveys whether the given mask was annotated automatically or by a human annotator. The score field conveys the confidence for an automatically annotated mask.
-
For the val and test sets, all annotations were done by humans, so the
is_gtandscorefields can be ignored. -
For the
first_frame_annotationsfiles (useful for exemplar-guided tasks), there are two additional fieldsbboxandpointwhich convey the bounding box and a random point on the object for the first frame in which it occurs. -
In the
segmentationsandtrack_category_idsfields, track IDs are encoded as strings (the JSON file format enforces that dict keys must be strings). Remember to cast them as int when parsing the annotations.
For evaluating your predicted results, the code expects a single JSON file with the same format as the ground-truth format explained above. For the exemplar-guided and open-world tasks, the track_category_ids field is still required for parsing the prediction file, but the value is irrelevant i.e. you can assign any class ID to the tracks.