Skip to content

PKU-SEC-Lab/TEAM-MoE-dLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

TBD

We identify a fundamental mismatch between MoE architectures and dLLM. A large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications.

We propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates.

Overall Performance:

With SDAR 30B-A3B (SDAR) model, TEAM achieves an average speedup of 1.94× across diverse benchmarks, with a peak speedup of up to 2.2× on the HumanEval benchmark.

TBD

Installation

1.Clone the repository:

git clone https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.git
cd TEAM-MoE-dLLM

2.Install conda:

conda create --name <your_env_name> python=3.10
conda activate <your_env_name>

2.Install dependencies:

Follow the Environment Setup in SDAR or install dependencies by:

conda env create -f evaluation/environment.yml

which mirrors SDAR's configuration.

Usage

Download the SDAR-30B-A3B model from Hugging face.

You can choose to directly perform the inference or simultaneously output the information on expert activation and decoding order during this period.

1.Directly inference:

Replace the modeling_sdar_moe.py in the downloaded model with the modeling_sdar_moe.py provided by us.

cd evaluation/opencompass
CUDA_VISIBLE_DEVICES=<GPU_ID> python run.py configs/eval_sdar_hf_<Task_Name>.py

Parameter descriptions:

  • <GPU_ID>: Choose which GPU to run on
  • <Task_Name>: Select the benchmark for evaluation
    • Options: gsm8k, math, humaneval, mbpp

Example:

CUDA_VISIBLE_DEVICES=0 python run.py configs/eval_sdar_hf_gsm8k.py

Please make sure to replace the model path in the model_configs of eval_sdar_hf_<Task_Name>.py with the actual path of your downloaded model before inference.

2.Inference with relevant information output:

Replace the modeling_sdar_moe.py in the downloaded model with the modeling_sdar_moe_mark.py provided by us.

cd evaluation/opencompass
CUDA_VISIBLE_DEVICES=<GPU_ID> python run.py configs/eval_sdar_hf_<Task_Name>_mark.py

Example:

CUDA_VISIBLE_DEVICES=0 python run.py configs/eval_sdar_hf_gsm8k_mark.py

Please make sure to replace the model path in the model_configs of eval_sdar_hf_<Task_Name>_mark.py with the actual path of your downloaded model before inference. Please make sure to replace the relevant paths in the evaluation\opencompass\opencompass\openicl\icl_inferencer\icl_gen_inferencer.py as needed.

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{wei2026team,
  title={TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration},
  author={Wei, Linye and Luo, Zixiang and Tang, Pingzhi and Li, Meng},
  journal={arXiv preprint arXiv:2602.08404},
  year={2026}
}

Acknowledgements

This repo is largely based on SDAR. We would like to thank the authors of this for their excellent work and open-source contributions.

Contact

If you have any questions, please contact us via email [email protected].

About

Initial Implementation of "TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages