Skip to content

added moe support for lfm#374

Merged
HenryNdubuaku merged 4 commits intomainfrom
karen/lfm2moe
Feb 23, 2026
Merged

added moe support for lfm#374
HenryNdubuaku merged 4 commits intomainfrom
karen/lfm2moe

Conversation

@kar-m
Copy link
Copy Markdown
Collaborator

@kar-m kar-m commented Feb 20, 2026

No description provided.

Signed-off-by: Karen Mosoyan <[email protected]>
@kar-m kar-m marked this pull request as ready for review February 20, 2026 07:58
Copilot AI review requested due to automatic review settings February 20, 2026 07:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Mixture of Experts (MoE) support for the LFM2 architecture, enabling the inference of LFM2 models with sparse MoE layers (like LiquidAI/LFM2-8B-A1B). The implementation follows the existing LFM2Model design while adding MoE-specific routing and expert computation logic.

Changes:

  • Added LFM2MoEModel C++ class that extends the base Model class with MoE routing and expert selection
  • Implemented moe_expert_apply graph operation for efficient sparse expert computation
  • Extended Python converter to handle MoE weight patterns and special loading for lfm2_moe models that may not work with standard HuggingFace loading
  • Added MoE-specific configuration parameters (num_experts, num_experts_per_tok, moe_intermediate_dim, etc.)

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated no comments.

Show a summary per file
File Description
python/src/weight_patterns.py Adds MoE weight patterns with {channel} placeholders for expert indices (router, expert_bias, and per-expert w1/w3/w2 weights)
python/src/converter.py Implements channel-based pattern expansion for MoE experts; adds special handling for lfm2_moe to bypass HuggingFace model loading and work directly with raw state_dict
python/src/config_utils.py Extracts MoE-specific configuration (num_experts_per_tok, norm_topk_prob, use_expert_bias, routed_scaling_factor) using consistent cfg_get pattern
python/src/cli.py Adds _load_raw_hf_state_dict helper and _RawModelWrapper class to support lfm2_moe models that may have tokenizer or loading issues
cactus/models/model_lfm2moe.cpp Implements LFM2MoEModel with build_mlp supporting both dense and MoE layers; includes routing logic with sigmoid + bias, topk selection, and per-expert gated SwiGLU
cactus/models/model.h Declares LFM2MoEModel class with ExpertWeights structure and LayerWeights containing moe_router, moe_expert_bias, and vector of expert weights
cactus/graph/graph_ops_nn.cpp Implements compute_moe_expert_apply_node with sparse token selection, compact matmul, and normalized routing weights with configurable scaling
cactus/graph/graph_execute.cpp Adds MOE_EXPERT_APPLY to execution dispatch and updates profiling column widths to accommodate longer operation names
cactus/graph/graph_builder.cpp Adds moe_expert_apply builder with comprehensive shape validation for accumulator, hidden state, routing probs, topk indices, and expert weights
cactus/graph/graph.h Adds MOE_EXPERT_APPLY OpType and normalize_routing flag to OpParams; declares moe_expert_apply graph builder function
cactus/engine/engine_model.cpp Adds MoE config parsing and factory logic to instantiate LFM2MoEModel when num_experts/moe_intermediate_dim/num_experts_per_tok are non-zero
cactus/engine/engine.h Adds MoE configuration fields to Config struct with appropriate defaults
README.md Adds LiquidAI/LFM2-8B-A1B to supported models list

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rshemet
Copy link
Copy Markdown
Collaborator

rshemet commented Feb 20, 2026

super excited about this one

Signed-off-by: HenryNdubuaku <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
@HenryNdubuaku HenryNdubuaku merged commit 93f49c2 into main Feb 23, 2026
1 of 2 checks passed
Mandark-droid added a commit to Mandark-droid/cactus that referenced this pull request Feb 23, 2026
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM
implementation to use the generalized moe_layer graph operation introduced
in PR cactus-compute#374, instead of a custom per-expert loop.

Key changes:
- build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern
- WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream
- Weight file naming follows upstream convention (moe_expert_ prefix)
- MoE detection via config fields (num_experts > 0) like LFM2, no new enum
- Attention uses INT8 KV cache with attention_int8_hybrid (standard path)
- QK normalization per-head before RoPE (Qwen3 architecture requirement)

Bug fixes included:
- Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k
- Move token_history to file scope with clear_sample_history() to prevent
  cross-model sampling contamination
- Add FP32 softmax support for MoE router weight precision
- Fix config parsing to strip \r\n from values

Python conversion:
- Detect Qwen3MoeForCausalLM model type
- Per-expert SwiGLU weight extraction (fused and individual tensor formats)
- FP16 auto-promotion for MoE router weights

Signed-off-by: Mandark-droid <[email protected]>

https://claude.ai/code/session_01SFkVPWXCCtTTpmMwsj274A
Mandark-droid added a commit to Mandark-droid/cactus that referenced this pull request Feb 23, 2026
Addresses review feedback on PR cactus-compute#383: refactors the Qwen3MoeForCausalLM
implementation to use the generalized moe_layer graph operation introduced
in PR cactus-compute#374, instead of a custom per-expert loop.

Key changes:
- build_mlp uses gb->moe_layer() matching the LFM2MoEModel pattern
- WeightNodeIDs uses ExpertWeights struct (w1/w3/w2) consistent with upstream
- Weight file naming follows upstream convention (moe_expert_ prefix)
- MoE detection via config fields (num_experts > 0) like LFM2, no new enum
- Attention uses INT8 KV cache with attention_int8_hybrid (standard path)
- QK normalization per-head before RoPE (Qwen3 architecture requirement)

Bug fixes included:
- Fix greedy sampling (temperature=0) to use pure argmax regardless of top_p/top_k
- Move token_history to file scope with clear_sample_history() to prevent
  cross-model sampling contamination
- Add FP32 softmax support for MoE router weight precision
- Fix config parsing to strip \r\n from values

Python conversion:
- Detect Qwen3MoeForCausalLM model type
- Per-expert SwiGLU weight extraction (fused and individual tensor formats)
- FP16 auto-promotion for MoE router weights

Signed-off-by: Mandark-droid <[email protected]>
ncylich pushed a commit that referenced this pull request Feb 24, 2026
* added moe support for lfm

Signed-off-by: Karen Mosoyan <[email protected]>

* optimisations

Signed-off-by: HenryNdubuaku <[email protected]>

* cleanup

Signed-off-by: HenryNdubuaku <[email protected]>

* generalise moe op

Signed-off-by: HenryNdubuaku <[email protected]>

---------

Signed-off-by: Karen Mosoyan <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Co-authored-by: HenryNdubuaku <[email protected]>
cattermelon1234 pushed a commit to cattermelon1234/cactus that referenced this pull request Feb 28, 2026
* added moe support for lfm

Signed-off-by: Karen Mosoyan <[email protected]>

* optimisations

Signed-off-by: HenryNdubuaku <[email protected]>

* cleanup

Signed-off-by: HenryNdubuaku <[email protected]>

* generalise moe op

Signed-off-by: HenryNdubuaku <[email protected]>

---------

Signed-off-by: Karen Mosoyan <[email protected]>
Signed-off-by: HenryNdubuaku <[email protected]>
Co-authored-by: HenryNdubuaku <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants