This is the official evaluation code for Spatial-Temporal Video Action Grounding (SVAG) task.
Given a video and a natural language query, our task requires detecting and tracking all referents that satisfy the query, along with their corresponding most relevant moments.
The SVAG evaluation protocol consists of two complementary components: Spatial and Temporal evaluation, designed to assess both spatial accuracy and temporal consistency of visual grounding models.
Based on the TrackEval repository.
We use HOTA as the evaluation metric.
Based on the standalone_eval from FlashVTG repository.
We use mIoU, R1@X, R5@X, R10@X as the evaluation metrics.
Furthermore, we introduce m-HIoU as the metric to rank submissions on the competition server. It is the average of HOTA and mIoU.
Clone this repository.
git clone https://github.com/Shuaicong97/SVAGEval.git
cd SVAGEval
Download the packages. Python version 3.10.16 can be used for reproduce.
pip install -r requirements.txt
When ground truth for the test sets is known, the code can be run with one of the following commands:
cd scripts
sh run.sh
cd scripts
python run.py
Ensure the paths and filenames are correctly specified in run.sh or run.py.
The final evaluation output will be written into combined_result_mean.json, as defined by OUTPUT_FILE in run.sh
or by final_result in run.py.
Optional: Add arguments in spatial_eval/evaluate.sh to change the number of cores running run_mot_challenge.py:
python3 ../TrackEval/scripts/run_mot_challenge.py \
...
--USE_PARALLEL True \
--NUM_PARALLEL_CORES 2 \
...
The ground truth file is in JSON format. It should contain the following contents:
{
"datasets": [dataset]
}
dataset{
"name": string,
"queries": [query]
}
query{
"query_id": int,
"query": string
"video_id": int,
"video_name": string,
"video_length": int,
"tracks": [track]
}
track{
"track_id": int,
"spatial": [[x,y,width,height] or None],
"temporal": [[start,end]]
}
The prediction file is in JSON format. It should contain the following contents:
{
"datasets": [dataset]
}
dataset{
"name": string,
"queries": [query]
}
query{
"query_id": int,
"query": string
"video_id": int,
"video_name": string,
"video_length": int,
"tracks": [track]
}
track{
"track_id": int,
"spatial": [[x,y,width,height] or None],
"temporal": [[start,end,score] or None]
}
It must combine all three subsets (OVIS, MOT17 and MOT20). Below is a concrete example of prediction:
{
"datasets": [{
"name": "OVIS",
"queries": [{
"query_id": 5113,
"query": "The giraffe bends its neck around another giraffe",
"video_id": 3,
"video_name": "fb4a7958",
"video_length": 79,
"tracks": [{
"track_id": 881,
"spatial": [
null,[667.5802612304688,582.8107299804688,753.3036499023438,308.951904296875],
[645.5720825195312,582.27880859375,729.5850219726562,310.3974609375],
[638.4644775390625,594.1441650390625,681.0743408203125,299.79638671875],
...
],
"temporal":[
[10.0,74.0,0.33719998598098755],
[8.0,50.0,0.32989999651908875],
[12.0,32.0,0.29510000348091125],
...
]
}, ...]
}, ...]
}, ...]
}
| entry | description |
|---|---|
name |
string, dataset name. Must be one of OVIS, MOT17, or MOT20 |
track_id |
int, unique track id for each instance |
spatial |
list(list or null), bounding box information. The list length is the length of the video. The value is either a specific box [x,y,width,height] or null (no object in this frame) |
temporal in ground truth |
list(list), temporal ground truth. Each sublist contains 2 elements, [start,end] |
temporal in prediction |
list(list or null), moment retrieval predictions. The value is either a specific moment prediction [start,end,score] or null (no moment retrieval predictions) |
To submit your results to Codabench for evaluation, please follow these steps:
- Save your predictions in a file named
submission.json, formatted as described above for the prediction file. - Compress the
submission.jsonfile into a ZIP archive namedsubmission.zip. - Upload the
submission.zipfile to the competition server on Codabench for evaluation.
Note: Make sure the zip archive contains only the submission.json file at the root level (not inside a subfolder).
If you would like to evaluate performance without access to the ground truth of the official test set, you can create a custom benchmark in two ways:
- Split the provided training set into a new training and held-out test/validation subset. This allows you to estimate performance using known ground truth from the original data.
- Use your dataset, independent of the provided data. As long as your data is converted into the required ground truth and prediction formats, you can reuse the existing evaluation pipeline. See Format for more details on the required format.
SVAGEval is released under the MIT License.
If you encounter any problems with the code, feel free to post an issue. Please contact Shuaicong Wu ([email protected]) or Tanveer Hannan ([email protected]). If anything is unclear or hard to use, please leave a comment either via email or as an issue. We would love to help.
We refer to the repositories TrackEval and FlashVTG. Thanks for their wonderful works.
If you're using SVAGEval in your research or applications, please cite using this BibTeX:
@misc{hannan2025svagbenchlargescalebenchmarkmultiinstance,
title={SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding},
author={Tanveer Hannan and Shuaicong Wu and Mark Weber and Suprosanna Shit and Jindong Gu and Rajat Koner and Aljoša Ošep and Laura Leal-Taixé and Thomas Seidl},
year={2025},
eprint={2510.13016},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.13016},
}