Official implementation of the papers
2Department of Development and Regeneration, KU Leuven, Leuven, Belgium
Multimodal models often let one modality dominate training, hurting overall performance.
We propose Multi-Loss Gradient Modulation (MLB): a method that combines unimodal losses with adaptive gradient balancing.
Unlike prior work, MLB can both accelerate and decelerate modality learning, and naturally phases out balancing at convergence.
- Consistently outperforms state-of-the-art balancing methods across audio-video (CREMA-D, AVE, UCF) and video-optical flow (Something-Something) datasets
- Works with different backbones (ResNet, Conformer) and fusion strategies (Late, Mid, FiLM, Gated, Transformer)
- Improves accuracy, and calibration
The core idea of our Multi-Loss Balanced (MLB) method is illustrated below, in contrast with previous approaches:
-
(a) Gradient Balancing Methods: These methods estimate unimodal performance to calculate coefficients (k_a, k_v) that are used to balance only the gradients from the main multimodal network.
-
(b) Multi-Task Methods: These use separate unimodal classifiers (CLS Heads) to get better performance estimates, but they use the resulting coefficients to balance only the unimodal losses.
-
(c) Proposed MLB Method: Our approach combines both strategies. It uses unimodal classifiers for accurate performance estimation and then uses these estimates to modulate the gradients for
The balancing coefficients are estimated as follows:
Table results demonstrate that MLB balances modality contributions more effectively, leading to consistent improvements across diverse multimodal domains.
MLB/
│── agents/
│ └── helpers/ # Evaluator, Loader, Trainer, Validator
│── configs/ # Configuration files for each experiment + default configs
│── datasets/ # Datasets' loaders
│── figs/ # Figures & sample outputs
│── models/ # Model architectures
│── posthoc/ # Post-hoc Testing & Evaluation scripts
│── utils/ # Utility scripts
│── run.sh # Shell script to launch training/testing
│── train.py # Training entry point
│── show.py # Showcasing trained models
│── requirements.txt # Dependencies
└── README.md # You are here
Training can be initiated via the command line. For example:
python train.py --config ./configs/CREMA_D/res/MLB.json --default_config ./configs/default_config.json --fold 0 --lr 0.0001 --wd 0.0001 --alpha 1.5You will find all the configurations used for each training in the run.sh file.
Our experiments evaluate MLB across diverse multimodal benchmarks.
| Dataset | Modalities | Task | Link |
|---|---|---|---|
| CREMA-D | Video + Audio | Emotion recognition | CREMA-D |
| AVE | Video + Audio | Action recognition | AVE |
| UCF | Video (+ Audio) | Action recognition | UCF101 |
| CMU-MOSEI | Video + Audio + Text | Sentiment & emotion analysis | MOSEI |
| Something-Something v2 | Video + Optical Flow | Fine-grained action recognition | Sth-Sth |
For feedback, questions, or collaboration opportunities, feel free to reach out at [email protected].
We welcome pull requests, issues, and discussions on this repository.
If you find our work inspiring or use our codebase in your research, please consider giving a star ⭐ and a citation.
@inproceedings{kontras_2024_MLB,
author = {Konstantinos Kontras and Christos Chatzichristos and Matthew B. Blaschko and Maarten De Vos},
title = {Improving Multimodal Learning with Multi-Loss Gradient Modulation},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year = {2024},
url = {https://papers.bmvc2024.org/0977.pdf}
}


