Add Qwen3MoeForCausalLM support#383
Add Qwen3MoeForCausalLM support#383Mandark-droid wants to merge 2 commits intocactus-compute:mainfrom
Conversation
dd57ef9 to
509cca1
Compare
|
Hey @HenryNdubuaku @rshemet — this PR adds native Qwen3 MoE (Qwen3MoeForCausalLM) support to Cactus. It's been rebased on top of #374 (Karen's LFM2 MoE) so both can land cleanly. Tested end-to-end on Android ARM64 with the Loggenix 0.6B model — prefill and decode produce coherent output. Would appreciate a review when you get a chance! |
de56c4a to
5a747ea
Compare
|
thanks so much for this @Mandark-droid , one thing, you often wanna wait for pending PRS to be merged, else you might build on faulty code. For instance, I refactored that PR to have a generalised moe_layer, and merged, so you now have to work with that rather than recreating everything :( |
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM implementation to use the generalized moe_layer graph operation introduced in PR cactus-compute#374, instead of a custom per-expert loop. Key changes: - build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern - WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream - Weight file naming follows upstream convention (moe_expert_ prefix) - MoE detection via config fields (num_experts > 0) like LFM2, no new enum - Attention uses INT8 KV cache with attention_int8_hybrid (standard path) - QK normalization per-head before RoPE (Qwen3 architecture requirement) Bug fixes included: - Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k - Move token_history to file scope with clear_sample_history() to prevent cross-model sampling contamination - Add FP32 softmax support for MoE router weight precision - Fix config parsing to strip \r\n from values Python conversion: - Detect Qwen3MoeForCausalLM model type - Per-expert SwiGLU weight extraction (fused and individual tensor formats) - FP16 auto-promotion for MoE router weights Signed-off-by: Mandark-droid <[email protected]> https://claude.ai/code/session_01SFkVPWXCCtTTpmMwsj274A
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM implementation to use the generalized moe_layer graph operation introduced in PR cactus-compute#374, instead of a custom per-expert loop. Key changes: - build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern - WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream - Weight file naming follows upstream convention (moe_expert_ prefix) - MoE detection via config fields (num_experts > 0) like LFM2, no new enum - Attention uses INT8 KV cache with attention_int8_hybrid (standard path) - QK normalization per-head before RoPE (Qwen3 architecture requirement) Bug fixes included: - Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k - Move token_history to file scope with clear_sample_history() to prevent cross-model sampling contamination - Add FP32 softmax support for MoE router weight precision - Fix config parsing to strip \r\n from values Python conversion: - Detect Qwen3MoeForCausalLM model type - Per-expert SwiGLU weight extraction (fused and individual tensor formats) - FP16 auto-promotion for MoE router weights Signed-off-by: Mandark-droid <[email protected]>
Signed-off-by: Mandark-droid <[email protected]>
5a747ea to
32eded7
Compare
|
Thanks for the feedback @HenryNdubuaku — good lesson learned on building on pending PRs. I've rebased on main and refactored to use the generalized Validated on Android ARM64 (Pixel 7a, Tensor G2) with the Loggenix 0.6B model — all tests passing:
Model init: ~352ms. Ready for another look when you get a chance! |
|
Hi @HenryNdubuaku , thanks again for the feedback on PR #383. I wanted to follow up since it was closed rather than merged — was there a specific reason for that? I'm wondering if the Qwen3 MoE changes and Loggenix model support ended up in main through a different route, or if there's something I should adjust to get it across. Let me know what makes sense for next steps! |
Add Qwen3MoeForCausalLM support
Adds native Qwen3 MoE (Mixture of Experts) support to the Cactus v1.x graph engine. Uses the generalized
moe_layer()graph operation from #374, combining QwenModel's decoder (GQA, QK norm, RoPE, INT8 KV cache) with SwiGLU (3-weight) expert FFN routing.What
C++ model (
model_qwen_moe.cpp,model.h)Qwen3MoeModelclass usinggb->moe_layer()— matches the LFM2MoE patternExpertWeightsstruct (w1/w3/w2) consistent with upstream conventionattention_int8_hybrid(standard path)num_experts > 0) — no new enum neededEngine integration (
engine_model.cpp)model_type=qwen3_moeparsed asModelType::QWEN, factory branches onnum_experts > 0to createQwen3MoeModelmoe_intermediate_size,norm_topk_prob,num_experts_per_tokPython conversion (
config_utils.py,converter.py,weight_patterns.py,tensor_io.py)Qwen3MoeForCausalLM/qwen3_moemodel typenum_experts_per_tok,moe_intermediate_size,norm_topk_probSampling fixes (
kernel_nn.cpp,kernel.h,graph_core.cpp,graph.h)token_historyto file scope withclear_sample_history()to prevent cross-model contaminationtemperature=0) to use pure argmaxArchitecture support
Parameterized for the full Qwen3 MoE family:
Benchmarks
Loggenix-MoE-0.62B (16 experts, top-2, 512 hidden, 12 layers) on Android ARM64 (Pixel 7a, Tensor G2):
Model init: ~352ms. All 5 tests passing.
Zero new graph ops
All operations already exist: TOPK, SOFTMAX, MATMUL, SILU, MULTIPLY, RMSNORM, ROPE, INDEX, plus the generalized
moe_layerfrom #374.