-
Notifications
You must be signed in to change notification settings - Fork 14.1k
support GLM-4.5V and GLM-4.1V vision models #16600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
need `clip.vision.rope.freq_base` for GLM-4.5V
|
So, it turns out that vision in this model is based on Qwen3-VL, which still needs support from llama.cpp. I am pretty familiar with llama.cpp in general but not with Also just saw this thread (#16207) in which someone has posted a patch to get Qwen3-VL kinda-sorta-working in llama.cpp. I will take a look at that too and see if it is helpful - it might make more sense to get Qwen3-VL to a working state in llama.cpp first and only then start working on this PR on top of that. Not sure, just thinking out loud. |
|
Thanks for your work! @ddh0 |
Thank you @rujialiu! I suspect your understanding of the Also cc @ngxson (llama vision expert :)) |
I have 0 understanding of |
|
@ddh0 I asked Claude Sonnet 4.5 to carefully inspect
It's so similar to Qwen2.5-VL, but why the code re-uses qwen3_vl_moe? It's because Qwen2.5-VL doesn't have moe version 😄 So I guess it's ok to resume the work directly, based on https://github.com/FMayran/llama.cpp/tree/QwenVL-causal-fix It should be easy to adapt to whatever "llama_batch improvement" is merged into BTW: Can we make sure the dense version (GLM-4.1V-9B-Thinking #14495 ) is working first? It's much smaller, easier to compare result with |
|
Thank you @rujialiu, that's all very helpful. I will take a look at GLM-4.1V-9B-Thinking and see if it can be incorporated into this PR. Is there a PR associated with the branch you linked ( |
Of course! Hopefully @ngxson will find some time to fix the general problem (adding an internal token index for casual check). Since you're familiar with LLM part, you can take a look our discussion in #15474 (the quickiest way is to read in a bottom-up order until you understand). The issue and solution is conceptually very simple, but I'm not brave/skillful enough to touch |
now there is: #16745 |
still need to figure out what exactly needs to be changed...
This is essentially the same thing that LFM2 use, you can copy most of the code from this model (already supported by mtmd) The key difference is in the projection stage, GLM4V uses:
|
|
|
||
| llm_build_glm4v::llm_build_glm4v(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) { | ||
| // | ||
| // TODO -- currently this is just copied from `llm_build_glm4` -- still WIP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally, text model of "vision" model is just a normal text model, so you probably don't need to add a new arch for it (no need to change anything in main llama.cpp code). only thing need to change is mtmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am still not 100% sure if we need separate architectures for these vision models or not. The paper mentions:
To further enhance spatial awareness on the language side, we extend RoPE to 3D-RoPE in the LLM.
I think what they're referring to as "3D-RoPE in the LLM" is actually M-RoPE, which glm4 and glm4_moe do not use.
Maybe M-RoPE could be conditionally incorporated into the existing llm_build_glm4 and llm_build_glm4_moe graph, but I thought it would be cleaner for the implementation of the vision models to be separate. I also did it this way following the pattern of Qwen3 / Qwen3VL being separate, as I think GLM is not too dissimilar from those.
also renamed `glm4v_moe.cpp` to `glm4v-moe.cpp` to match other model files
|
Can you also add to Ollama cloud with vision / multimodel support? |
|
GLM Support has landed and open for review: #17967 - Enjoy! |
|
@ddh0 are you still actively working on this PR? I'll have a look in upcoming days |
|
Nevemind, this PR doesn't have the tensor_mapping for mmproj that I need, so probably better for me to start from zero |
No, I sort of got stuck and wasn't sure how to proceed, and I also got a job so I have less free time now. As I'm sure you know there have been some MRoPE fixes/additions since I first started this PR so there is probably something I'm missing.
Sure, I would appreciate it if you took over, you know what you're doing more than I do. |
|
@ddh0 — Thanks for the great starting point, your work was super helpful! I just tried to contribute a working path to unblock the community while official maintainer approved support for glm4.6v is in consideration/progress. Please feel free to use in any manner #17998 (the take-or-leave code works), if helpful. Apologies for the noise/thrash! |
|
Closing as superseded by #18042 - thanks ngxson!! |

Add support for zai-org/GLM-4.5V and zai-org/GLM-4.1V-9B-Thinking vision models to llama.cpp. I currently only plan to support images + text, no video inputs in this PR.
The architecture is
Glm4vMoeForConditionalGeneration("model_type": "glm4v_moe") /Glm4vForConditionalGeneration("model_type": "glm4v"). Internally, these consist of an LLM (text model) and a ViT (vision adapter / multimodal projector):LLM
model.language_model.apply_multimodal_rotary_pos_emb, it applies rotary embeddings across temporal, height, and width dimensions for visual tokensViT
Aimv2VisionModelmodel.visual.Glm4vMoeVisionEmbeddingsmodule to handle varied image resolutionsapply_rotary_pos_emb_vision)Other notes:
References:
See also: