Skip to content

EPFL-VILAB/tst-mm-vision

Repository files navigation

Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Kunal Pratap Singh*, Ali Garjani*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

Website | Paper | Datasets | BibTeX

TST teaser figure

Multimodality offers a natural self-supervised signal: a model can learn by predicting one sensor modality from another. Test-Space Training (TST) studies this in a controlled sandbox, where a device collects unlabeled multimodal data in a test environment, pre-trains on it, and is evaluated in the same space. This setup lets us ask how far multimodal self-supervision can go in producing specialist models for a known deployment space, and how this compares to internet-scale generalist pre-training.

Usage

This repository is built upon the 4M codebase. To get started, follow the installation steps below.

Installation

  1. Clone this repository and navigate to the root directory:
git clone https://github.com/epfl-vilab/tst-mm-vision
cd tst-mm-vision
  1. Create a new conda environment, then install the package and its dependencies:
conda create -n tst python=3.10 -y
conda activate tst
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Verify that CUDA is available in PyTorch by running the following in a Python shell:
# Run in Python shell
import torch
print(torch.cuda.is_available())  # Should return True

If CUDA is not available, consider re-installing PyTorch following the official installation instructions. Likewise, if you want to install xFormers (optional, for faster tokenizers), follow their README to ensure that the CUDA version is correct.

Dataset

We evaluate TST on the test spaces acquired using the following datasets: ScanNet++, ProcTHOR, and Replica. To download the multimodal pre-training datasets as well as the transfer sets, execute the following command:

python tools/download_datasets.py

This command by default downloads all three datasets. To specify the dataset you'd like to download, run the following command:

python tools/download_datasets.py --dataset <dataset_name> # can be one of 'procthor', 'replica' or 'scannet++'

The downloaded dataset for ScanNet++ do not contain the rgb images and the semantic segmentation labels. To obtain those please follow the instructions here.

Pre-Training & Adaptation

For pre-training the TST-MM model from scratch, you can follow the instructions provided below, which include some additional logging arguments:

OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_training_4m.py \
--config cfgs/<test-space-name>/main/<config>.yaml \ # path to the pre-training config for each dataset
--wandb_entity <wandb-entity-name> \
--wandb_project <wandb-project-name> \
--wandb_run_name <wandb-run-name> \ # if not specified the run name will be set automatically
--output_dir <path-to-save-outputs> # directory where model checkpoints and logs will be saved. if not specified an `outputs` folder will be created

Note: Adjust the --nproc_per_node parameter based on the number of GPUs available on your system. For example, if you have 4 GPUs, set it to 4.

For more detailed configuration options, you can modify the corresponding YAML file specified in the --config parameter. The YAML files contain various hyperparameters and training settings that you can customize according to your needs.

After pre-training, the weights will be saved in the directory specified in --output_dir. These weights can then be used for fine-tuning on downstream tasks.

We provide the pre-trained TST-MM and TST-MM (adapted) weights, as well as the pre-training configuration in the resources.

Fine-tuning

  • For instruction on how to fine-tune the pre-trained models for the semantic segmentation task, see the instructions here located in the segmentation directory.
  • For instruction on how to fine-tune the pre-trained models for the captioning task, see the instructions here located in the captioning directory.

Resources

We provide pre-trained model weights, their corresponding configuration files, and multimodal pre-training datasets for each test space dataset.

Test Space Datasets Pre-Training Config Pre-Trained Weights Adaptation Config Adaptation Weights
ScanNet++ Datasets Config Checkpoint Config Checkpoint
ProcTHOR Datasets Config Checkpoint Config Checkpoint
Replica Datasets Config Checkpoint Config Checkpoint

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find our work helpful, please consider citing our work:

@article{singh2026tst,
        title={Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality},
        author={Kunal Pratap Singh and Ali Garjani and Rishubh Singh and Muhammad Uzair Khattak and Efe Tarhan and Jason Toskov and Andrei Atanov and O{\u{g}}uzhan Fatih Kar and Amir Zamir},
        journal={ICLR},
        year={2026}
    }

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors