Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 12, 2025

The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd

While the code is not very clean, this should already allow:


Key points

  • Each model's preprocessor now have their own subclass extended from mtmd_audio_preprocessor
  • Preprocessor can access hparams directly (to read audio params like n_mel, n_fft, etc)
  • Each preprocessor also have its own initialize() function which will be called on model load, to initialize global cache entries like sin/cos, hann window
  • Filter bank is now constructed dynamically thanks to @tdakhran 's implementation of fill_mel_filterbank_matrix (the hard-coded value is now removed)

@ngxson ngxson requested a review from ggerganov December 12, 2025 22:58
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 12, 2025

Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran

I will remove the pre-calculated filters and replace with your version

Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way

@ngxson ngxson marked this pull request as draft December 12, 2025 23:24
@ngxson ngxson marked this pull request as ready for review December 13, 2025 14:06
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 13, 2025

@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp

Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor

Test results:

[audio]  OK:   ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't spot anything suspicious. Though ultimately we have to do some actual tests to be sure that the Whisper still works.

I assume @tdakhran did the necessary verification, so I think it's OK to proceed.

Copy link
Contributor

@tdakhran tdakhran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngxson, thanks for the refactor.

@ggerganov I've verified that generated coefficients match existing for n_fft = 400.

params.sample_rate = hparams.audio_sample_rate;
params.center_padding = false;
params.preemph = 0.0f; // disabled
params.use_natural_log = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params.use_natural_log = true; for LFM2-Audio-1.5B, I'd like to avoid reimplementing the whole processor just because it. Shall all params members be defined in hparams?

Copy link
Collaborator Author

@ngxson ngxson Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other models, it's recommended to make a dedicated class that extends from mtmd_audio_preprocessor:

struct mtmd_audio_preprocessor_lfm2a : mtmd_audio_preprocessor {
    mtmd_audio_preprocessor_lfm2a(const clip_ctx * ctx) : mtmd_audio_preprocessor(ctx) {}
    void initialize() override;
    bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override;
};

This way, you can customize initialization of cache, while also allow defining custom filter params and handling custom paddings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will create a dedicated class.

@tdakhran
Copy link
Contributor

tdakhran commented Dec 15, 2025

@ngxson for reference this is how rebase LFM2 Audio ASR PR ngxson#58 look

@ngxson ngxson merged commit 96a181a into ggml-org:master Dec 15, 2025
65 of 69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants