Masked Modeling Duo (M2D) & M2D-CLAP

Masked Modeling Duo (M2D)

M2D-CLAP

Masked Modeling Duo (M2D) & M2D-CLAP

This repository provides demo implementations of our paper "Masked Modeling Duo: Towards a Universal Audio Pre-training Framework", "M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP", "Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input", and so on.

🌟 Looking for the best general-purpose audio model? M2D-CLAP achieves state-of-the-art performance on audio tagging, zero-shot classification, and audio-language tasks — try it instantly in Colab.

Quick Start

Description	Notebook
Audio tagging example (M2D)	examples/Colab_M2D_example_Tagging.ipynb
Zero-shot ESC-50 classification with M2D-CLAP	examples/Colab_M2D-CLAP_ESC-50_ZS.ipynb
Audio feature visualization example with M2D-CLAP	examples/Colab_M2D-CLAP_ESC-50_VizualizeEmbs.ipynb

The example below uses M2D-CLAP, our recommended model. You can load it and encode audio in just a few lines:

# 🔊 Load model
from examples.portable_m2d import PortableM2D  # portable_m2d: a simple one-file loader
model = PortableM2D('m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025/checkpoint-30.pth')

# 🎵 Prepare input (three 10-s waveforms, range [-1., 1.])
import torch
batch_audio = 2 * torch.rand((3, 10 * 16000)) - 1.0

# 📐 Encode → frame-level features
frame_level = model(batch_audio)
print(frame_level.shape)  # torch.Size([3, 63, 3840])

# 📦 Aggregate → clip-level features
clip_level = torch.mean(frame_level, dim=1)
print(clip_level.shape)  # torch.Size([3, 3840])

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-CLAP_2025 ⭐	Recommended. Best for CLAP / audio tagging (AT) / sound event detection (SED).	m2d_clap_vit_base-80x1001p16x16p16kpBpTI-2025	✅	0.490
M2D-CLAP_2024, additionally fine-tuned on AS2M	2nd Best for AT/SED. (Encoder only)	m2d_clap_vit_base-80x1001p16x16-240128_AS-FT_enconly	N/A	0.485
M2D-AS fine-tuned on AS2M	3rd best for AT/SED. (Encoder only)	m2d_as_vit_base-80x1001p16x16-240213_AS-FT_enconly	N/A	0.485
M2D/0.7 fine-tuned on AS2M	4th best for AT/SED. (Encoder only)	m2d_vit_base-80x1001p16x16-221006-mr7_as_46ab246d	N/A	0.479
M2D/0.7	General-purpose transfer learning and further pre-training.	m2d_vit_base-80x608p16x16-221006-mr7	✅	-
M2D/0.7	General-purpose transfer learning. (Encoder only)	m2d_vit_base-80x608p16x16-221006-mr7_enconly	N/A	-
M2D/0.7	General-purpose transfer learning. (Encoder only)	m2d_vit_base-80x608p16x16-220930-mr7_enconly	N/A	-
M2D/0.7 (t.f. 40ms)	General-purpose transfer learning and further pre-training w/ finer time frame.	m2d_vit_base-80x200p16x4-230529	✅	-
M2D-X/0.7 (η= 0.3)	The best ICBHI 2017 model in Section IV-E on the TASLP paper.	m2d_x_icbhi	N/A	-
M2D/0.6	General-purpose transfer learning and further pre-training.	m2d_vit_base-80x608p16x16-221006-mr6	✅	-
M2D-CLAP_2024 (Older)	General-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.	m2d_clap_vit_base-80x608p16x16-240128	✅	-
M2D-AS	General-purpose transfer learning and further pre-training, especially when application data is closer to the AudioSet ontology.	m2d_as_vit_base-80x608p16x16-240213	✅	-
MSM-MAE/0.75	Predecessor to M2D; for reproducibility or comparison.	msm_mae_vit_base-80x608p16x16-220924-mr75	✅	-

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-AS fine-tuned on AS2M@32kHz	Best for audio tagging (AT) / sound event detection (SED) at 32 kHz.	m2d_as_vit_base-80x1001p16x16p32k-240413_AS-FT_enconly	N/A	0.480
M2D-AS@32kHz	General-purpose transfer learning at 32 kHz. (Encoder only)	m2d_as_vit_base-80x608p16x16p32k-240413_enconly	N/A	-

LibriSpeech pre-trained weights

Description	Recommendation	Weight	Fur-PT Ready	AS2M mAP
M2D-S/0.6 6-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x608p80x2-230220	✅	-
M2D-S/0.6 5-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x512p80x2-230301	✅	-
M2D-S/0.6 4-s input	Speech transfer learning and further pre-training.	m2d_s_vit_base-80x400p80x2-230201	✅	-

Application Resources

👉 Application Guide (alpha) is available. -- Our guidelines may provide useful information on how to plan further pre-train your models.

👉 Resources for M2D-CLAP (General-purpose Audio-Language Representation).
👉 Resources for M2D-X medical applications (ICBHI2017/SPRSound), further pre-training examples.
👉 Resources for M2D medical application (CirCor DigiScope heart sound).
👉 Resources for M2D-AS (M2D-X specialized in AudioSet).
👉 Resources for M2D-S (M2D-X specialized in Speech).
👉 Resources for M2D on respiratory sound analysis (OPERA benchmark) — Pre-training and evaluation resources for respiratory sounds using M2D, hosted in the EVAR repository (see our Interspeech 2025 paper).
👉 Resources for M2D on music understanding (MARBLE benchmark) — Integration of M2D with the MARBLE music benchmark, hosted in the EVAR repository (covered in the M2D-CLAP 2025 paper).
👉 MSM-MAE pre-training (predecessor to M2D) — Pre-training guide for MSM-MAE, the model that M2D builds upon. Provided for reproducibility and comparison.

A schematic illustration of M2D-X further pre-training:

1. Setup

The repository is based on the codes from facebookresearch/mae, and we patch our changes on these files.

Download external source files and apply a patch.

git clone https://github.com/nttcslab/m2d.git
cd m2d
curl -o util/lars.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lars.py
curl -o util/lr_decay.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_decay.py
curl -o util/lr_sched.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/lr_sched.py
curl -o util/misc.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/misc.py
curl -o util/analyze_repr.py https://raw.githubusercontent.com/daisukelab/general-learning/master/SSL/analyze_repr.py
curl -o m2d/pos_embed.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/util/pos_embed.py
curl -o train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o speech/train_speech.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o audioset/train_as.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o clap/clap_only.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o clap/train_clap.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o mae_train_audio.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/main_pretrain.py
curl -o m2d/engine_pretrain_m2d.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/engine_pretrain.py
curl -o m2d/models_mae.py https://raw.githubusercontent.com/facebookresearch/mae/efb2a8062c206524e35e47d04501ed4f544c0ae8/models_mae.py
curl -o m2d/timm_layers_pos_embed.py https://raw.githubusercontent.com/huggingface/pytorch-image-models/e9373b1b925b2546706d78d25294de596bad4bfe/timm/layers/pos_embed.py
patch -p1 < patch_m2d.diff

Install external modules listed on requirements.txt.
```
pip install -r requirements.txt
```

2. Evaluating M2D

We use the EVAR for our evaluation.

2-1. Setup EVAR

EVAR is an evaluation package for audio representations used by our research papers such as BYOL-A.

The following steps set up EVAR.

In the folder of your copy of the M2D repository, clone the EVAR repository and prepare basic items.

git clone https://github.com/nttcslab/eval-audio-repr.git evar
cd evar
curl https://raw.githubusercontent.com/daisukelab/general-learning/master/MLP/torch_mlp_clf2.py -o evar/utils/torch_mlp_clf2.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/sampler.py -o evar/sampler.py
curl https://raw.githubusercontent.com/daisukelab/sound-clf-pytorch/master/for_evar/cnn14_decoupled.py -o evar/cnn14_decoupled.py
cd ..

Set up downstream task datasets according to Preparing-datasets.md. The following is an example for setting up CREMA-D dataset.

cd evar
python evar/utils/download_cremad.py downloads/cremad
python prepare_wav.py downloads/cremad work/16k/cremad 16000
cd ..

2-2. Linear Evaluation

Once you set up EVAR, you can evaluate your models as follows.

For evaluating a model with an absolute path /your/path/to/model.pth.

cd evar
python lineareval.py config/m2d.yaml cremad weight_file=/your/path/to/model.pth

If you want to save GPU memory, set a smaller batch size as follows. This example sets it as 16.

cd evar
python lineareval.py config/m2d.yaml cremad batch_size=16,weight_file=/your/path/to/model.pth

We used the all_eval.sh script to evaluate on all downstream tasks.

2-3. Fine-tuning

We have fine-tuned our models using the scripts in the util folder.

The following examples fine-tune on each downstream task three times with seed 42. Replace /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 with your actual model path.

cd evar
bash <path/to/m2d>/util/ft-as2m.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # AudioSet 2M
bash <path/to/m2d>/util/ft-as0k.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # AudioSet 20K
bash <path/to/m2d>/util/ft-esc50.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # ESC-50
bash <path/to/m2d>/util/ft-spc.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # Speech Commands
bash <path/to/m2d>/util/ft-vc1.sh /your/path/to/m2d_vit_base-80x608p16x16-221006-mr7 3 42 300  # VoxCeleb1

NOTE: Please set your data path in `util/ft-as2m.sh`

The ft-as2m.sh requires the path to your log-mel spectrogram AudioSet samples in .npy. Update it with your data path before running.

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

The pre-trainer (e.g., train_audio.py for audio) loads data from the data folder by default (--data_path), using a list of samples in a CSV data/files_audioset.csv by default (--csv_main). Follow the steps in data/README.md.

The following is an example using the FSD50K dataset.

Preprocess .wav files into log-mel spectrogram .npy files. The following converts from a source folder /your/local/fsd50k/FSD50K.dev_audio to a new folder data/fsd50k_lms.
```
python wav_to_lms.py /your/local/fsd50k/FSD50K.dev_audio data/fsd50k_lms
```
Create a CSV file that will be used as a list of pre-training samples, containing a single column file_name. The following example creates files_f_s_d_5_0_k.csv.
```
echo file_name > data/files_f_s_d_5_0_k.csv
(cd data && find fsd50k_lms/FSD50K.dev_audio -name "*.npy") >> data/files_f_s_d_5_0_k.csv
```

Example of created folder structure:

data/
    files_f_s_d_5_0_k.csv
    fsd50k_lms/
        FSD50K.dev_audio/
            2931.npy
            408195.npy
                :

3-2. Start pre-training

Once your data is ready, start pre-training as follows.

python train_audio.py --csv_main data/files_f_s_d_5_0_k.csv

3-3. Evaluation during and after the training

The training loop automatically evaluates the pre-trained model.

During pre-training, train_audio.py runs a script called quick_eval.sh as a sub-process. You can edit quick_eval.sh for your purposes.
When the pre-training is finished, the final evaluation script all_eval.sh is executed.

3-4. Complete pre-training command lines

The command lines for pre-training full-performance models follow:

# M2D
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m train_audio --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --model m2d_vit_base --csv_main data/files_audioset.csv --data_path /path/to/your/data --loss_off 0.
# M2D-AS
OMP_NUM_THREADS=1 torchrun --nproc_per_node=4 -m audioset.train_as --input_size 80x608 --patch_size 16x16 --epochs 300 --batch_size 512 --accum_iter 1 --save_freq 50 --seed 3 --data_path /path/to/your/data --loss_off 1.

Note: Replace /path/to/your/data with the path to your LMS data directory. Placing data on fast storage (SSD recommended) significantly speeds up training. If --data_path is omitted, the data/ directory at the repository root is used.

Example logs are available: example_logs.zip.

We explain the details in the Guide_app.md.

For other model variants, see also:

M2D-CLAP pre-training — multi-stage training for audio-language representation
MSM-MAE pre-training — predecessor to M2D

4. Other Pre-trained/fine-tuned Weights

Please find all pre-trained/fine-tuned weights published on the releases.

5. License

See LICENSE.pdf for details.

Citations

If you find our M2D or M2D-CLAP useful in your research, please consider citing our papers.

@article{niizumi2025m2d-clap,
    author  = {Niizumi, Daisuke and Takeuchi, Daiki and Yasuda, Masahiro and Nguyen, Binh Thien and Ohishi, Yasunori and Harada, Noboru},
    journal = {IEEE Access}, 
    title   = {{M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP}}, 
    year    = {2025},
    volume  = {13},
    pages   = {163313-163330},
    doi={10.1109/ACCESS.2025.3611348}}

@article{niizumi2024m2dx,
    title   = {{Masked Modeling Duo: Towards a Universal Audio Pre-training Framework}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    journal = {IEEE/ACM Trans. Audio, Speech, Language Process.},
    year    = {2024},
    volume  = {32},
    pages   = {2391-2406},
    url     = {https://ieeexplore.ieee.org/document/10502167},
    doi     = {10.1109/TASLP.2024.3389636}}

@inproceedings{niizumi2024m2d-clap,
    title   = {{M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Masahiro Yasuda and Shunsuke Tsubaki and Keisuke Imoto},
    booktitle={Interspeech},
    year    = {2024},
    pages   = {57--61},
    doi     = {10.21437/Interspeech.2024-29}}

@inproceedings{niizumi2023m2d,
    title   = {{Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    booktitle={ICASSP}, 
    year    = {2023},
    url     = {https://ieeexplore.ieee.org/document/10097236},
    doi     = {10.1109/ICASSP49357.2023.10097236}}

@inproceedings{niizumi2023m2d4speech,
    title   = {{Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation}},
    author  = {Daisuke Niizumi and Daiki Takeuchi and Yasunori Ohishi and Noboru Harada and Kunio Kashino},
    year    = {2023},
    booktitle={Interspeech},
    pages   = {1294--1298},
    doi     = {10.21437/Interspeech.2023-221}}

@inproceedings{niizumi2024embc,
    title   = {{Exploring Pre-trained General-purpose Audio Representations for Heart Murmur Detection}},
    author  = {Niizumi, Daisuke and Takeuchi, Daiki and Ohishi, Yasunori and Harada, Noboru and Kashino, Kunio},
    booktitle={EMBC},
    year    = {2024},
    pages   = {1-4},
    doi     = {10.1109/EMBC53108.2024.10782479}}

Acknowledgements

Our code is based on the MAE PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners.
We use nnAudio (KinWaiCheuk/nnAudio) for converting raw audio into log-mel spectrogram.

We appreciate these publicly available implementations and all the modules our experiments heavily depend on!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Masked Modeling Duo (M2D) & M2D-CLAP

Quick Start

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

LibriSpeech pre-trained weights

Application Resources

1. Setup

2. Evaluating M2D

2-1. Setup EVAR

2-2. Linear Evaluation

2-3. Fine-tuning

NOTE: Please set your data path in `util/ft-as2m.sh`

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

3-2. Start pre-training

3-3. Evaluation during and after the training

3-4. Complete pre-training command lines

4. Other Pre-trained/fine-tuned Weights

5. License

Citations

Acknowledgements

References

About

Licenses found

Uh oh!

Releases 4

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
app		app
audioset		audioset
clap		clap
data		data
examples		examples
m2d		m2d
speech		speech
superb/upstream/m2d		superb/upstream/m2d
util		util
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE.pdf		LICENSE.pdf
Note_viz_m2d.ipynb		Note_viz_m2d.ipynb
Note_viz_msm_mae.ipynb		Note_viz_msm_mae.ipynb
README.md		README.md
all_eval.sh		all_eval.sh
audio_dataset.py		audio_dataset.py
common.py		common.py
patch_m2d.diff		patch_m2d.diff
quick_eval.sh		quick_eval.sh
requirements.txt		requirements.txt
wav_to_lms.py		wav_to_lms.py

License

Licenses found

nttcslab/m2d

Folders and files

Latest commit

History

Repository files navigation

Masked Modeling Duo (M2D) & M2D-CLAP

Quick Start

Pre-trained/Fine-tuned Weights

AudioSet pre-trained weights

LibriSpeech pre-trained weights

Application Resources

1. Setup

2. Evaluating M2D

2-1. Setup EVAR

2-2. Linear Evaluation

2-3. Fine-tuning

NOTE: Please set your data path in util/ft-as2m.sh

3. Pre-training From Scratch

3-1. Prepare pre-training data samples

3-2. Start pre-training

3-3. Evaluation during and after the training

3-4. Complete pre-training command lines

4. Other Pre-trained/fine-tuned Weights

5. License

Citations

Acknowledgements

References

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Languages

NOTE: Please set your data path in `util/ft-as2m.sh`

Packages