-
Notifications
You must be signed in to change notification settings - Fork 14.1k
model : add ASR support for LFM2-Audio-1.5B #17694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
tested that [
{"role": "system", "content": "Perform ASR."},
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"format": "wav",
"data": base64.b64encode(pathlib.Path("/data/playground/issue_400/10.wav").read_bytes()).decode(
"utf-8"
),
},
},
],
},
] |
|
The code is tested, will wait for #17978 to be merged, and then rebase and mark it as "ready for review". |
50597aa to
5044ab6
Compare
|
The code is ready for review and is tested with produces valid results for the attached file |
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3)); | ||
| Q_bias_u = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3)); | ||
| ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur); | ||
| matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3)); | ||
| cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il); | ||
|
|
||
| auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb); | ||
| cb(p, "conformer.layers.{}.self_attn.linear_pos", il); | ||
| p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]); | ||
|
|
||
| Q_bias_v = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3)); | ||
| cb(Q_bias_v, "conformer.layers.{}.self_attn.id0", il); | ||
| p = ggml_cont(ctx0, ggml_permute(ctx0, p, 1, 2, 0, 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we could replace this with build_attn?
the advantage of build_attn is that it supports flash attn which can significantly improve the performance, but I'm not sure if there is currently anything missing to make it work in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some extra stuff like biases, matrix_ac, matrix_bd, it scared me followed Python implementation as is, will give it a second look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looked into it, build_attn won't fit, too many customizations to attention.
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h); | ||
| matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd, | ||
| q_len, pos_len, h, | ||
| matrix_bd->nb[1], matrix_bd->nb[2], matrix_bd->nb[0] * q_len)); | ||
| matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, pos_len, q_len, h); | ||
| } | ||
|
|
||
| matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit strange that we're having these 4 reshapes / view without any permutations. can we collapse this into one single ggml_reshape_3d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were a plain view, reshapes could be simplified. There is a crop happening inside ggml_view_3d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah interesting. not very important to optimize this, so I'll have a look later to see if there is another way
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); | ||
| x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b); | ||
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may be able to remove of transposes if conv_norm_b is already transpose upon conversion?
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); | ||
| auto * conv_pw2_w = ggml_reshape_2d(ctx0, layer.conv_pw2_w, layer.conv_pw2_w->ne[1], layer.conv_pw2_w->ne[2]); | ||
| x = ggml_mul_mat(ctx0, conv_pw2_w, x); | ||
| x = ggml_add(ctx0, x, layer.conv_pw2_b); | ||
| x = ggml_cont(ctx0, ggml_transpose(ctx0, x)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll have a look into this), I suspect that these 2 transposes can be removed too (or at worse, one can be a view)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many transposes here are following the Python code without optimization in mind. The objective was to get numerically close intermediates. I'll have a closer look to understand what can be optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed most of transposes
tools/mtmd/models/lfm2-audio-enc.cpp
Outdated
| cur = ggml_mul_mat(ctx0, model.mm_1_w, cur); | ||
| cur = ggml_add(ctx0, cur, model.mm_1_b); | ||
| cb(cur, "audio_adapter.model.{}", 1); | ||
| cur = ggml_gelu_erf(ctx0, cur); | ||
| cb(cur, "audio_adapter.model.{}", 2); | ||
| cur = ggml_mul_mat(ctx0, model.mm_3_w, cur); | ||
| cur = ggml_add(ctx0, cur, model.mm_3_b); | ||
| cb(cur, "audio_adapter.model.{}", 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be replaced with build_ffn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't recognize it, will replace
8ba4562 to
ba9e597
Compare
|
Rebased to incorporate #18061, now works as is. |
|
Thanks @tdakhran ! I'll do a final review tmr and will push commits directly here if needed. For now, my priority will be to make sure that the GGUF is ready for any possible optimizations in the future. We can then look deeper into these optimizations in a follow-up PR (so users won't have to re-generate the GGUF) |
This comment was marked as outdated.
This comment was marked as outdated.
|
nevermind, I can do a follow-up PR |
|
Superseded by #18106 |
|
@ngxson , my bad, I think I forgot to click "allow edits" when created PR |

LFM2-Audio-1.5B supports audio input and audio output.
PR adds only ASR support. To perform ASR invoke CLI with
Changes to existing code:
-sysenabled forllama-mtmd-clin_fftvaluesOP_SSM_CONVfor CUDA backend is extended to support kernel size 9cc: @ngxson