mtmd: refactor audio preprocessing #17978

ngxson · 2025-12-12T22:58:19Z

The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd

While the code is not very clean, this should already allow:

Simplify model : add ASR support for LFM2-Audio-1.5B #17694
Potentially support gemma 3n audio

Key points

Each model's preprocessor now have their own subclass extended from mtmd_audio_preprocessor
Preprocessor can access hparams directly (to read audio params like n_mel, n_fft, etc)
Each preprocessor also have its own initialize() function which will be called on model load, to initialize global cache entries like sin/cos, hann window
Filter bank is now constructed dynamically thanks to @tdakhran 's implementation of fill_mel_filterbank_matrix (the hard-coded value is now removed)

ngxson · 2025-12-12T23:23:53Z

Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran

I will remove the pre-calculated filters and replace with your version

Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way

Co-authored-by: Tarek <[email protected]>

ngxson · 2025-12-13T14:18:02Z

@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp

Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor

Test results:

[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M

ggerganov

I didn't spot anything suspicious. Though ultimately we have to do some actual tests to be sure that the Whisper still works.

I assume @tdakhran did the necessary verification, so I think it's OK to proceed.

tools/mtmd/mtmd-audio.h

tdakhran

@ngxson, thanks for the refactor.

@ggerganov I've verified that generated coefficients match existing for n_fft = 400.

tools/mtmd/mtmd-audio.cpp

tdakhran · 2025-12-15T10:00:02Z

tools/mtmd/mtmd-audio.cpp

+    params.sample_rate      = hparams.audio_sample_rate;
+    params.center_padding   = false;
+    params.preemph          = 0.0f; // disabled
+    params.use_natural_log  = false;


params.use_natural_log = true; for LFM2-Audio-1.5B, I'd like to avoid reimplementing the whole processor just because it. Shall all params members be defined in hparams?

For other models, it's recommended to make a dedicated class that extends from mtmd_audio_preprocessor:

struct mtmd_audio_preprocessor_lfm2a : mtmd_audio_preprocessor { mtmd_audio_preprocessor_lfm2a(const clip_ctx * ctx) : mtmd_audio_preprocessor(ctx) {} void initialize() override; bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override; };

This way, you can customize initialization of cache, while also allow defining custom filter params and handling custom paddings

sounds good, will create a dedicated class.

tdakhran · 2025-12-15T10:53:57Z

@ngxson for reference this is how rebase LFM2 Audio ASR PR ngxson#58 look

mtmd: refactor audio preprocessing

7cc2cf1

ngxson requested a review from ggerganov December 12, 2025 22:58

ngxson marked this pull request as draft December 12, 2025 23:24

github-actions bot added the examples label Dec 13, 2025

ngxson and others added 5 commits December 13, 2025 13:28

refactor

9e2cd84

Co-authored-by: Tarek <[email protected]>

wip

cea1a90

wip (2)

7b578b5

improve constructor

4b63e61

fix use_natural_log

93290e5

ngxson marked this pull request as ready for review December 13, 2025 14:06

fix padding for short input

1aaec3b

tdakhran mentioned this pull request Dec 14, 2025

model : add ASR support for LFM2-Audio-1.5B #17694

Closed

ggerganov approved these changes Dec 15, 2025

View reviewed changes

tools/mtmd/mtmd-audio.h Outdated Show resolved Hide resolved

tdakhran reviewed Dec 15, 2025

View reviewed changes

clean up

7b417fd

ngxson mentioned this pull request Dec 15, 2025

Rebased ASR for LFM2-Audio-1.5B ngxson/llama.cpp#58

Closed

remove need_chunking

8ace9e6

ngxson merged commit 96a181a into ggml-org:master Dec 15, 2025
65 of 69 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd: refactor audio preprocessing #17978

mtmd: refactor audio preprocessing #17978

ngxson commented Dec 12, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 12, 2025 •

edited

Loading

Uh oh!

ngxson commented Dec 13, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

tdakhran left a comment

Uh oh!

Uh oh!

Uh oh!

tdakhran Dec 15, 2025

Uh oh!

ngxson Dec 15, 2025 •

edited

Loading

Uh oh!

tdakhran Dec 15, 2025

Uh oh!

tdakhran commented Dec 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mtmd: refactor audio preprocessing #17978

mtmd: refactor audio preprocessing #17978

Conversation

ngxson commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key points

Uh oh!

ngxson commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tdakhran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tdakhran Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdakhran Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

tdakhran commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ngxson commented Dec 12, 2025 •

edited

Loading

ngxson commented Dec 12, 2025 •

edited

Loading

ngxson commented Dec 13, 2025 •

edited

Loading

ngxson Dec 15, 2025 •

edited

Loading

tdakhran commented Dec 15, 2025 •

edited

Loading