Skip to content

Conversation

@tdakhran
Copy link
Contributor

@tdakhran tdakhran commented Dec 2, 2025

LFM2-Audio-1.5B supports audio input and audio output.

PR adds only ASR support. To perform ASR invoke CLI with

bin/llama-mtmd-cli -m LFM2-Audio-1.5B-F32.gguf --mmproj mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio input.wav -sys "Perform ASR." -p "<__media__>"

Changes to existing code:

  • model requires system prompt, -sys enabled for llama-mtmd-cli
  • mel bins generation reworked, now it is generated dynamically and supports different n_fft values
  • OP_SSM_CONV for CUDA backend is extended to support kernel size 9

cc: @ngxson

@tdakhran
Copy link
Contributor Author

tdakhran commented Dec 2, 2025

tested that llama-server works as intended with input

[
        {"role": "system", "content": "Perform ASR."},
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "format": "wav",
                        "data": base64.b64encode(pathlib.Path("/data/playground/issue_400/10.wav").read_bytes()).decode(
                            "utf-8"
                        ),
                    },
                },
            ],
        },
    ]

@tdakhran tdakhran changed the title model : add LFM2-Audio-1.5B support model : add ASR support for LFM2-Audio-1.5B Dec 2, 2025
@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Dec 2, 2025
@tdakhran
Copy link
Contributor Author

The code is tested, will wait for #17978 to be merged, and then rebase and mark it as "ready for review".

@tdakhran
Copy link
Contributor Author

The code is ready for review and is tested with mtmd-cli and llama-server.

python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32
python convert_hf_to_gguf.py  /data/playground/checkpoints/LFM2-Audio-1.5B --outtype f32 --mmproj

build/bin/llama-mtmd-cli -m /data/playground/checkpoints/LFM2-Audio-1.5B/LFM2-Audio-1.5B-F32.gguf --mmproj /data/playground/checkpoints/LFM2-Audio-1.5B/mmproj-LFM2-Audio-1.5b-F32.gguf -n 30 --audio /data/playground/issue_400/10.wav -sys "Perform ASR." -p "<__media__>" -v

produces valid results for the attached file
10.wav

encoding audio slice...
audio slice encoded in 39 ms
decoding audio batch 1/1, n_tokens_batch = 33
audio decoded (batch 1/1) in 109 ms

I need more air. Can you increase the fan speed?

Comment on lines 114 to 126
Kcur = ggml_cont(ctx0, ggml_permute(ctx0, Kcur, 0, 2, 1, 3));
Q_bias_u = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_u, 0, 2, 1, 3));
ggml_tensor * matrix_ac = ggml_mul_mat(ctx0, Q_bias_u, Kcur);
matrix_ac = ggml_cont(ctx0, ggml_permute(ctx0, matrix_ac, 1, 0, 2, 3));
cb(matrix_ac, "conformer.layers.{}.self_attn.id3", il);

auto * p = ggml_mul_mat(ctx0, layer.linear_pos_w, pos_emb);
cb(p, "conformer.layers.{}.self_attn.linear_pos", il);
p = ggml_reshape_3d(ctx0, p, d_head, n_head, p->ne[1]);

Q_bias_v = ggml_cont(ctx0, ggml_permute(ctx0, Q_bias_v, 0, 2, 1, 3));
cb(Q_bias_v, "conformer.layers.{}.self_attn.id0", il);
p = ggml_cont(ctx0, ggml_permute(ctx0, p, 1, 2, 0, 3));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we could replace this with build_attn?

the advantage of build_attn is that it supports flash attn which can significantly improve the performance, but I'm not sure if there is currently anything missing to make it work in this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some extra stuff like biases, matrix_ac, matrix_bd, it scared me followed Python implementation as is, will give it a second look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looked into it, build_attn won't fit, too many customizations to attention.

Comment on lines 141 to 148
matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, q_len, pos_len + 1, h);
matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,
q_len, pos_len, h,
matrix_bd->nb[1], matrix_bd->nb[2], matrix_bd->nb[0] * q_len));
matrix_bd = ggml_reshape_3d(ctx0, matrix_bd, pos_len, q_len, h);
}

matrix_bd = ggml_cont(ctx0, ggml_view_3d(ctx0, matrix_bd,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit strange that we're having these 4 reshapes / view without any permutations. can we collapse this into one single ggml_reshape_3d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it were a plain view, reshapes could be simplified. There is a crop happening inside ggml_view_3d.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah interesting. not very important to optimize this, so I'll have a look later to see if there is another way

Comment on lines 209 to 211
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
x = ggml_add(ctx0, ggml_mul(ctx0, x, layer.conv_norm_w), layer.conv_norm_b);
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may be able to remove of transposes if conv_norm_b is already transpose upon conversion?

Comment on lines 217 to 221
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
auto * conv_pw2_w = ggml_reshape_2d(ctx0, layer.conv_pw2_w, layer.conv_pw2_w->ne[1], layer.conv_pw2_w->ne[2]);
x = ggml_mul_mat(ctx0, conv_pw2_w, x);
x = ggml_add(ctx0, x, layer.conv_pw2_b);
x = ggml_cont(ctx0, ggml_transpose(ctx0, x));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll have a look into this), I suspect that these 2 transposes can be removed too (or at worse, one can be a view)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many transposes here are following the Python code without optimization in mind. The objective was to get numerically close intermediates. I'll have a closer look to understand what can be optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed most of transposes

Comment on lines 251 to 258
cur = ggml_mul_mat(ctx0, model.mm_1_w, cur);
cur = ggml_add(ctx0, cur, model.mm_1_b);
cb(cur, "audio_adapter.model.{}", 1);
cur = ggml_gelu_erf(ctx0, cur);
cb(cur, "audio_adapter.model.{}", 2);
cur = ggml_mul_mat(ctx0, model.mm_3_w, cur);
cur = ggml_add(ctx0, cur, model.mm_3_b);
cb(cur, "audio_adapter.model.{}", 3);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be replaced with build_ffn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't recognize it, will replace

@tdakhran
Copy link
Contributor Author

@ngxson , I addressed most of the feedback, added a comment explaining why build_attn cannot be used, removed unnecessary transposes, and simplified permutes. Applied the formatting as well.

PR requires #18061, otherwise rope_theta won't be set.

@tdakhran tdakhran force-pushed the tarek/feat/lfm2-asr-upstream branch from 8ba4562 to ba9e597 Compare December 15, 2025 21:14
@tdakhran
Copy link
Contributor Author

Rebased to incorporate #18061, now works as is.

@ngxson
Copy link
Collaborator

ngxson commented Dec 15, 2025

Thanks @tdakhran ! I'll do a final review tmr and will push commits directly here if needed.

For now, my priority will be to make sure that the GGUF is ready for any possible optimizations in the future. We can then look deeper into these optimizations in a follow-up PR (so users won't have to re-generate the GGUF)

@ngxson

This comment was marked as outdated.

@ngxson
Copy link
Collaborator

ngxson commented Dec 16, 2025

nevermind, I can do a follow-up PR

@ngxson
Copy link
Collaborator

ngxson commented Dec 16, 2025

hein? I have no idea why github doesn't allow me to merge it 😂

I will copy the commit to another PR then

image

@ngxson
Copy link
Collaborator

ngxson commented Dec 16, 2025

Superseded by #18106

@tdakhran
Copy link
Contributor Author

@ngxson , my bad, I think I forgot to click "allow edits" when created PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants