Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality
Kunal Pratap Singh*, Ali Garjani*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir
Website | Paper | Datasets | BibTeX
Multimodality offers a natural self-supervised signal: a model can learn by predicting one sensor modality from another. Test-Space Training (TST) studies this in a controlled sandbox, where a device collects unlabeled multimodal data in a test environment, pre-trains on it, and is evaluated in the same space. This setup lets us ask how far multimodal self-supervision can go in producing specialist models for a known deployment space, and how this compares to internet-scale generalist pre-training.
This repository is built upon the 4M codebase. To get started, follow the installation steps below.
- Clone this repository and navigate to the root directory:
git clone https://github.com/epfl-vilab/tst-mm-vision
cd tst-mm-vision
- Create a new conda environment, then install the package and its dependencies:
conda create -n tst python=3.10 -y
conda activate tst
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Verify that CUDA is available in PyTorch by running the following in a Python shell:
# Run in Python shell
import torch
print(torch.cuda.is_available()) # Should return True
If CUDA is not available, consider re-installing PyTorch following the official installation instructions. Likewise, if you want to install xFormers (optional, for faster tokenizers), follow their README to ensure that the CUDA version is correct.
We evaluate TST on the test spaces acquired using the following datasets: ScanNet++, ProcTHOR, and Replica. To download the multimodal pre-training datasets as well as the transfer sets, execute the following command:
python tools/download_datasets.pyThis command by default downloads all three datasets. To specify the dataset you'd like to download, run the following command:
python tools/download_datasets.py --dataset <dataset_name> # can be one of 'procthor', 'replica' or 'scannet++'The downloaded dataset for ScanNet++ do not contain the rgb images and the semantic segmentation labels. To obtain those please follow the instructions here.
For pre-training the TST-MM model from scratch, you can follow the instructions provided below, which include some additional logging arguments:
OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_training_4m.py \
--config cfgs/<test-space-name>/main/<config>.yaml \ # path to the pre-training config for each dataset
--wandb_entity <wandb-entity-name> \
--wandb_project <wandb-project-name> \
--wandb_run_name <wandb-run-name> \ # if not specified the run name will be set automatically
--output_dir <path-to-save-outputs> # directory where model checkpoints and logs will be saved. if not specified an `outputs` folder will be createdNote: Adjust the
--nproc_per_nodeparameter based on the number of GPUs available on your system. For example, if you have 4 GPUs, set it to 4.
For more detailed configuration options, you can modify the corresponding YAML file specified in the --config parameter. The YAML files contain various hyperparameters and training settings that you can customize according to your needs.
After pre-training, the weights will be saved in the directory specified in --output_dir. These weights can then be used for fine-tuning on downstream tasks.
We provide the pre-trained TST-MM and TST-MM (adapted) weights, as well as the pre-training configuration in the resources.
- For instruction on how to fine-tune the pre-trained models for the semantic segmentation task, see the instructions here located in the
segmentationdirectory. - For instruction on how to fine-tune the pre-trained models for the captioning task, see the instructions here located in the
captioningdirectory.
We provide pre-trained model weights, their corresponding configuration files, and multimodal pre-training datasets for each test space dataset.
| Test Space | Datasets | Pre-Training Config | Pre-Trained Weights | Adaptation Config | Adaptation Weights |
|---|---|---|---|---|---|
| ScanNet++ | Datasets | Config | Checkpoint | Config | Checkpoint |
| ProcTHOR | Datasets | Config | Checkpoint | Config | Checkpoint |
| Replica | Datasets | Config | Checkpoint | Config | Checkpoint |
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
If you find our work helpful, please consider citing our work:
@article{singh2026tst,
title={Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality},
author={Kunal Pratap Singh and Ali Garjani and Rishubh Singh and Muhammad Uzair Khattak and Efe Tarhan and Jason Toskov and Andrei Atanov and O{\u{g}}uzhan Fatih Kar and Amir Zamir},
journal={ICLR},
year={2026}
}
