Skip to content

Shuaicong97/SVAGEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVAG Evaluation Toolkit

This is the official evaluation code for Spatial-Temporal Video Action Grounding (SVAG) task.

Task definition

Given a video and a natural language query, our task requires detecting and tracking all referents that satisfy the query, along with their corresponding most relevant moments.

Evaluation

The SVAG evaluation protocol consists of two complementary components: Spatial and Temporal evaluation, designed to assess both spatial accuracy and temporal consistency of visual grounding models.

Spatial evaluation

Based on the TrackEval repository.

We use HOTA as the evaluation metric.

Temporal evaluation

Based on the standalone_eval from FlashVTG repository.

We use mIoU, R1@X, R5@X, R10@X as the evaluation metrics.

Furthermore, we introduce m-HIoU as the metric to rank submissions on the competition server. It is the average of HOTA and mIoU.

Preparation

Clone this repository.

git clone https://github.com/Shuaicong97/SVAGEval.git
cd SVAGEval 

Download the packages. Python version 3.10.16 can be used for reproduce.

pip install -r requirements.txt

Running the code

When ground truth for the test sets is known, the code can be run with one of the following commands:

cd scripts 
sh run.sh
cd scripts 
python run.py

Ensure the paths and filenames are correctly specified in run.sh or run.py. The final evaluation output will be written into combined_result_mean.json, as defined by OUTPUT_FILE in run.sh or by final_result in run.py.

Optional: Add arguments in spatial_eval/evaluate.sh to change the number of cores running run_mot_challenge.py:

python3 ../TrackEval/scripts/run_mot_challenge.py \
...
--USE_PARALLEL True \
--NUM_PARALLEL_CORES 2 \
...

Format

The ground truth file is in JSON format. It should contain the following contents:

{
    "datasets": [dataset]
}

dataset{
    "name": string,
    "queries": [query]
}

query{
    "query_id": int,
    "query": string
    "video_id": int,
    "video_name": string,
    "video_length": int,
    "tracks": [track]
}

track{
    "track_id": int,
    "spatial": [[x,y,width,height] or None],
    "temporal": [[start,end]]
}

The prediction file is in JSON format. It should contain the following contents:

{
    "datasets": [dataset]
}

dataset{
    "name": string,
    "queries": [query]
}

query{
    "query_id": int,
    "query": string
    "video_id": int,
    "video_name": string,
    "video_length": int,
    "tracks": [track]
}

track{
    "track_id": int,
    "spatial": [[x,y,width,height] or None],
    "temporal": [[start,end,score] or None]
}

It must combine all three subsets (OVIS, MOT17 and MOT20). Below is a concrete example of prediction:

{
    "datasets": [{
        "name": "OVIS",
        "queries": [{
            "query_id": 5113,
            "query": "The giraffe bends its neck around another giraffe",
            "video_id": 3,
            "video_name": "fb4a7958",
            "video_length": 79,
            "tracks": [{
                "track_id": 881,
                "spatial": [
                    null,[667.5802612304688,582.8107299804688,753.3036499023438,308.951904296875],
                    [645.5720825195312,582.27880859375,729.5850219726562,310.3974609375],
                    [638.4644775390625,594.1441650390625,681.0743408203125,299.79638671875],
                    ...
                ],
                "temporal":[
                    [10.0,74.0,0.33719998598098755],
                    [8.0,50.0,0.32989999651908875],
                    [12.0,32.0,0.29510000348091125],
                    ...
                ]
            }, ...] 
        }, ...]
    }, ...]
}
entry description
name string, dataset name. Must be one of OVIS, MOT17, or MOT20
track_id int, unique track id for each instance
spatial list(list or null), bounding box information. The list length is the length of the video. The value is either a specific box [x,y,width,height] or null (no object in this frame)
temporal in ground truth list(list), temporal ground truth. Each sublist contains 2 elements, [start,end]
temporal in prediction list(list or null), moment retrieval predictions. The value is either a specific moment prediction [start,end,score] or null (no moment retrieval predictions)

Codabench submission

To submit your results to Codabench for evaluation, please follow these steps:

  1. Save your predictions in a file named submission.json, formatted as described above for the prediction file.
  2. Compress the submission.json file into a ZIP archive named submission.zip.
  3. Upload the submission.zip file to the competition server on Codabench for evaluation.

Note: Make sure the zip archive contains only the submission.json file at the root level (not inside a subfolder).

Evaluate on your own custom benchmark

If you would like to evaluate performance without access to the ground truth of the official test set, you can create a custom benchmark in two ways:

  1. Split the provided training set into a new training and held-out test/validation subset. This allows you to estimate performance using known ground truth from the original data.
  2. Use your dataset, independent of the provided data. As long as your data is converted into the required ground truth and prediction formats, you can reuse the existing evaluation pipeline. See Format for more details on the required format.

License

SVAGEval is released under the MIT License.

Contact

If you encounter any problems with the code, feel free to post an issue. Please contact Shuaicong Wu ([email protected]) or Tanveer Hannan ([email protected]). If anything is unclear or hard to use, please leave a comment either via email or as an issue. We would love to help.

Acknowledgement

We refer to the repositories TrackEval and FlashVTG. Thanks for their wonderful works.

If you're using SVAGEval in your research or applications, please cite using this BibTeX:

@misc{hannan2025svagbenchlargescalebenchmarkmultiinstance,
  title={SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding},
  author={Tanveer Hannan and Shuaicong Wu and Mark Weber and Suprosanna Shit and Jindong Gu and Rajat Koner and Aljoša Ošep and Laura Leal-Taixé and Thomas Seidl},
  year={2025},
  eprint={2510.13016},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2510.13016},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published