Merged
Conversation
Signed-off-by: Karen Mosoyan <[email protected]>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds Mixture of Experts (MoE) support for the LFM2 architecture, enabling the inference of LFM2 models with sparse MoE layers (like LiquidAI/LFM2-8B-A1B). The implementation follows the existing LFM2Model design while adding MoE-specific routing and expert computation logic.
Changes:
- Added LFM2MoEModel C++ class that extends the base Model class with MoE routing and expert selection
- Implemented moe_expert_apply graph operation for efficient sparse expert computation
- Extended Python converter to handle MoE weight patterns and special loading for lfm2_moe models that may not work with standard HuggingFace loading
- Added MoE-specific configuration parameters (num_experts, num_experts_per_tok, moe_intermediate_dim, etc.)
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| python/src/weight_patterns.py | Adds MoE weight patterns with {channel} placeholders for expert indices (router, expert_bias, and per-expert w1/w3/w2 weights) |
| python/src/converter.py | Implements channel-based pattern expansion for MoE experts; adds special handling for lfm2_moe to bypass HuggingFace model loading and work directly with raw state_dict |
| python/src/config_utils.py | Extracts MoE-specific configuration (num_experts_per_tok, norm_topk_prob, use_expert_bias, routed_scaling_factor) using consistent cfg_get pattern |
| python/src/cli.py | Adds _load_raw_hf_state_dict helper and _RawModelWrapper class to support lfm2_moe models that may have tokenizer or loading issues |
| cactus/models/model_lfm2moe.cpp | Implements LFM2MoEModel with build_mlp supporting both dense and MoE layers; includes routing logic with sigmoid + bias, topk selection, and per-expert gated SwiGLU |
| cactus/models/model.h | Declares LFM2MoEModel class with ExpertWeights structure and LayerWeights containing moe_router, moe_expert_bias, and vector of expert weights |
| cactus/graph/graph_ops_nn.cpp | Implements compute_moe_expert_apply_node with sparse token selection, compact matmul, and normalized routing weights with configurable scaling |
| cactus/graph/graph_execute.cpp | Adds MOE_EXPERT_APPLY to execution dispatch and updates profiling column widths to accommodate longer operation names |
| cactus/graph/graph_builder.cpp | Adds moe_expert_apply builder with comprehensive shape validation for accumulator, hidden state, routing probs, topk indices, and expert weights |
| cactus/graph/graph.h | Adds MOE_EXPERT_APPLY OpType and normalize_routing flag to OpParams; declares moe_expert_apply graph builder function |
| cactus/engine/engine_model.cpp | Adds MoE config parsing and factory logic to instantiate LFM2MoEModel when num_experts/moe_intermediate_dim/num_experts_per_tok are non-zero |
| cactus/engine/engine.h | Adds MoE configuration fields to Config struct with appropriate defaults |
| README.md | Adds LiquidAI/LFM2-8B-A1B to supported models list |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Collaborator
|
super excited about this one |
Signed-off-by: HenryNdubuaku <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Mandark-droid
added a commit
to Mandark-droid/cactus
that referenced
this pull request
Feb 23, 2026
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM implementation to use the generalized moe_layer graph operation introduced in PR cactus-compute#374, instead of a custom per-expert loop. Key changes: - build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern - WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream - Weight file naming follows upstream convention (moe_expert_ prefix) - MoE detection via config fields (num_experts > 0) like LFM2, no new enum - Attention uses INT8 KV cache with attention_int8_hybrid (standard path) - QK normalization per-head before RoPE (Qwen3 architecture requirement) Bug fixes included: - Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k - Move token_history to file scope with clear_sample_history() to prevent cross-model sampling contamination - Add FP32 softmax support for MoE router weight precision - Fix config parsing to strip \r\n from values Python conversion: - Detect Qwen3MoeForCausalLM model type - Per-expert SwiGLU weight extraction (fused and individual tensor formats) - FP16 auto-promotion for MoE router weights Signed-off-by: Mandark-droid <[email protected]> https://claude.ai/code/session_01SFkVPWXCCtTTpmMwsj274A
Mandark-droid
added a commit
to Mandark-droid/cactus
that referenced
this pull request
Feb 23, 2026
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM implementation to use the generalized moe_layer graph operation introduced in PR cactus-compute#374, instead of a custom per-expert loop. Key changes: - build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern - WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream - Weight file naming follows upstream convention (moe_expert_ prefix) - MoE detection via config fields (num_experts > 0) like LFM2, no new enum - Attention uses INT8 KV cache with attention_int8_hybrid (standard path) - QK normalization per-head before RoPE (Qwen3 architecture requirement) Bug fixes included: - Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k - Move token_history to file scope with clear_sample_history() to prevent cross-model sampling contamination - Add FP32 softmax support for MoE router weight precision - Fix config parsing to strip \r\n from values Python conversion: - Detect Qwen3MoeForCausalLM model type - Per-expert SwiGLU weight extraction (fused and individual tensor formats) - FP16 auto-promotion for MoE router weights Signed-off-by: Mandark-droid <[email protected]>
ncylich
pushed a commit
that referenced
this pull request
Feb 24, 2026
* added moe support for lfm Signed-off-by: Karen Mosoyan <[email protected]> * optimisations Signed-off-by: HenryNdubuaku <[email protected]> * cleanup Signed-off-by: HenryNdubuaku <[email protected]> * generalise moe op Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Karen Mosoyan <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
cattermelon1234
pushed a commit
to cattermelon1234/cactus
that referenced
this pull request
Feb 28, 2026
* added moe support for lfm Signed-off-by: Karen Mosoyan <[email protected]> * optimisations Signed-off-by: HenryNdubuaku <[email protected]> * cleanup Signed-off-by: HenryNdubuaku <[email protected]> * generalise moe op Signed-off-by: HenryNdubuaku <[email protected]> --------- Signed-off-by: Karen Mosoyan <[email protected]> Signed-off-by: HenryNdubuaku <[email protected]> Co-authored-by: HenryNdubuaku <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.