Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Kunal Pratap Singh*, Ali Garjani*, Rishubh Singh, Muhammad Uzair Khattak, Efe Tarhan, Jason Toskov, Andrei Atanov, Oğuzhan Fatih Kar, Amir Zamir

Website | Paper | Datasets | BibTeX

Multimodality offers a natural self-supervised signal: a model can learn by predicting one sensor modality from another. Test-Space Training (TST) studies this in a controlled sandbox, where a device collects unlabeled multimodal data in a test environment, pre-trains on it, and is evaluated in the same space. This setup lets us ask how far multimodal self-supervision can go in producing specialist models for a known deployment space, and how this compares to internet-scale generalist pre-training.

Usage

This repository is built upon the 4M codebase. To get started, follow the installation steps below.

Installation

Clone this repository and navigate to the root directory:

git clone https://github.com/epfl-vilab/tst-mm-vision
cd tst-mm-vision

Create a new conda environment, then install the package and its dependencies:

conda create -n tst python=3.10 -y
conda activate tst
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Verify that CUDA is available in PyTorch by running the following in a Python shell:

# Run in Python shell
import torch
print(torch.cuda.is_available())  # Should return True

If CUDA is not available, consider re-installing PyTorch following the official installation instructions. Likewise, if you want to install xFormers (optional, for faster tokenizers), follow their README to ensure that the CUDA version is correct.

Dataset

We evaluate TST on the test spaces acquired using the following datasets: ScanNet++, ProcTHOR, and Replica. To download the multimodal pre-training datasets as well as the transfer sets, execute the following command:

python tools/download_datasets.py

This command by default downloads all three datasets. To specify the dataset you'd like to download, run the following command:

python tools/download_datasets.py --dataset <dataset_name> # can be one of 'procthor', 'replica' or 'scannet++'

The downloaded dataset for ScanNet++ do not contain the rgb images and the semantic segmentation labels. To obtain those please follow the instructions here.

Pre-Training & Adaptation

For pre-training the TST-MM model from scratch, you can follow the instructions provided below, which include some additional logging arguments:

OMP_NUM_THREADS=1 torchrun --nproc_per_node=8 run_training_4m.py \
--config cfgs/<test-space-name>/main/<config>.yaml \ # path to the pre-training config for each dataset
--wandb_entity <wandb-entity-name> \
--wandb_project <wandb-project-name> \
--wandb_run_name <wandb-run-name> \ # if not specified the run name will be set automatically
--output_dir <path-to-save-outputs> # directory where model checkpoints and logs will be saved. if not specified an `outputs` folder will be created

Note: Adjust the --nproc_per_node parameter based on the number of GPUs available on your system. For example, if you have 4 GPUs, set it to 4.

For more detailed configuration options, you can modify the corresponding YAML file specified in the --config parameter. The YAML files contain various hyperparameters and training settings that you can customize according to your needs.

After pre-training, the weights will be saved in the directory specified in --output_dir. These weights can then be used for fine-tuning on downstream tasks.

We provide the pre-trained TST-MM and TST-MM (adapted) weights, as well as the pre-training configuration in the resources.

Fine-tuning

For instruction on how to fine-tune the pre-trained models for the semantic segmentation task, see the instructions here located in the segmentation directory.
For instruction on how to fine-tune the pre-trained models for the captioning task, see the instructions here located in the captioning directory.

Resources

We provide pre-trained model weights, their corresponding configuration files, and multimodal pre-training datasets for each test space dataset.

Test Space	Datasets	Pre-Training Config	Pre-Trained Weights	Adaptation Config	Adaptation Weights
ScanNet++	Datasets	Config	Checkpoint	Config	Checkpoint
ProcTHOR	Datasets	Config	Checkpoint	Config	Checkpoint
Replica	Datasets	Config	Checkpoint	Config	Checkpoint

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you find our work helpful, please consider citing our work:

@article{singh2026tst,
        title={Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality},
        author={Kunal Pratap Singh and Ali Garjani and Rishubh Singh and Muhammad Uzair Khattak and Efe Tarhan and Jason Toskov and Andrei Atanov and O{\u{g}}uzhan Fatih Kar and Amir Zamir},
        journal={ICLR},
        year={2026}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
captioning		captioning
cfgs		cfgs
fourm		fourm
segmentation		segmentation
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_training_4m.py		run_training_4m.py
run_training_4m_fsdp.py		run_training_4m_fsdp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Usage

Installation

Dataset

Pre-Training & Adaptation

Fine-tuning

Resources

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodality as Supervision: Self-Supervised Specialization to the Test Environment via Multimodality

Usage

Installation

Dataset

Pre-Training & Adaptation

Fine-tuning

Resources

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages