-
Notifications
You must be signed in to change notification settings - Fork 14.1k
mtmd: refactor audio preprocessing #17978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mtmd: refactor audio preprocessing #17978
Conversation
|
Hmm, I think I can also upstream some changes from #17694 , that would make your PR a bit shorter @tdakhran I will remove the pre-calculated filters and replace with your version Edit: since my goal is to implement conformer, I think I will end up copying a lot of code and refactor them along the way |
Co-authored-by: Tarek <[email protected]>
|
@ggerganov This is ready for review. I only have basic knowledge about signal/audio processing, would appreciate if you can have a deeper look to see if things are still correct compared to the original code from whisper.cpp Note: this change also contain enough code for LFM2-audio and gemma 3n audio preprocessor Test results: |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't spot anything suspicious. Though ultimately we have to do some actual tests to be sure that the Whisper still works.
I assume @tdakhran did the necessary verification, so I think it's OK to proceed.
tdakhran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson, thanks for the refactor.
@ggerganov I've verified that generated coefficients match existing for n_fft = 400.
| params.sample_rate = hparams.audio_sample_rate; | ||
| params.center_padding = false; | ||
| params.preemph = 0.0f; // disabled | ||
| params.use_natural_log = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params.use_natural_log = true; for LFM2-Audio-1.5B, I'd like to avoid reimplementing the whole processor just because it. Shall all params members be defined in hparams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For other models, it's recommended to make a dedicated class that extends from mtmd_audio_preprocessor:
struct mtmd_audio_preprocessor_lfm2a : mtmd_audio_preprocessor {
mtmd_audio_preprocessor_lfm2a(const clip_ctx * ctx) : mtmd_audio_preprocessor(ctx) {}
void initialize() override;
bool preprocess(const float * samples, size_t n_samples, std::vector<mtmd_audio_mel> & output) override;
};This way, you can customize initialization of cache, while also allow defining custom filter params and handling custom paddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, will create a dedicated class.
The goal of this PR is to allow more audio pre-processing mechanism to be added into mtmd
While the code is not very clean, this should already allow:
Key points
mtmd_audio_preprocessorinitialize()function which will be called on model load, to initialize global cache entries like sin/cos, hann windowfill_mel_filterbank_matrix(the hard-coded value is now removed)