Enabling Muon Optimizer in DeepSpeed #7509

PKUWZP · 2025-08-23T16:44:34Z

Related Issue: #7438

Introduction

Muon, a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a PR attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed.

Issues and solutions

Issues

With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the code.
The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2)

Solutions

To solve the issues, we propose this new PR in which:

We simplify the Muon code by removing the partitioning and muon updates logics.
We move the muon update to the get_flat_partition function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates.
We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints.
We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality.

Future directions and roadmap

In the future, several follow up works are of interests:

Create a CPU offload version.
Apply Muon to Stage 3
Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer.
More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially)

Adding Muon dependencies to setup.py file.

…add init file

deepspeed/runtime/zero/muon/original_muon.py

deepspeed/runtime/zero/muon/muon_optimizer.py

tests/unit/ops/muon/test_muon.py

@pengdurice

Authorship: @pengdurice and @PKUWZP Related Issue: #7438 # Introduction [Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a [PR](#7454) attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed. # Issues and solutions ## Issues 1. With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the [code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195). 2. The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2) ## Solutions To solve the issues, we propose this new PR in which: 1. We simplify the Muon code by [removing](master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22) the partitioning and muon updates logics. 2. We [move](master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867) the muon update to the [get_flat_partition](master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848) function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates. 3. We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints. 4. We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality. # Future directions and roadmap In the future, several follow up works are of interests: - [ ] Create a CPU offload version. - [ ] Apply Muon to Stage 3 - [ ] Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer. - [ ] More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially) --------- Co-authored-by: Peng Du <[email protected]> Co-authored-by: pengdurice <[email protected]> Co-authored-by: Zhipeng Wang <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Ma, Guokai <[email protected]>

The initialization of DeepCompile+Z1/2 now fails due to the change introduced in #7509. This PR resolves the issue by: - Adding an argument to optimizer.get_flat_partition - Skipping the entire allreduce function in the engine --------- Signed-off-by: Masahiro Tanaka <[email protected]>

delock · 2025-09-16T08:19:32Z

@PKUWZP is MuonClip (Used by Kimi K2 https://arxiv.org/abs/2507.20534) also of future interest? From the paper clipping is essential to avoid exploding attention logits if the model is very large.

The original Muon optimizer PR (#7509) requires user to explicitly set `use_muon` flags in `model.parameters()`, as shown in test https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27 . This PR integrate setting of `use_muon` into DeepSpeed before engine initialization. This makes Muon optimizer easier to use. User only needs to change optimizer in `config.json` from `AdamW` to `Muon`, no need to change code. It will solve the following issue #7552 --------- Signed-off-by: Ma, Guokai <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>

Original PR #7509 by PKUWZP Original: deepspeedai/DeepSpeed#7509

Merged from original PR #7509 Original: deepspeedai/DeepSpeed#7509

@pengdurice

Authorship: @pengdurice and @PKUWZP Related Issue: deepspeedai#7438 # Introduction [Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a [PR](deepspeedai#7454) attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed. # Issues and solutions ## Issues 1. With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the [code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195). 2. The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2) ## Solutions To solve the issues, we propose this new PR in which: 1. We simplify the Muon code by [removing](deepspeedai/DeepSpeed@master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22) the partitioning and muon updates logics. 2. We [move](deepspeedai/DeepSpeed@master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867) the muon update to the [get_flat_partition](deepspeedai/DeepSpeed@master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848) function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates. 3. We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints. 4. We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality. # Future directions and roadmap In the future, several follow up works are of interests: - [ ] Create a CPU offload version. - [ ] Apply Muon to Stage 3 - [ ] Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer. - [ ] More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially) --------- Co-authored-by: Peng Du <[email protected]> Co-authored-by: pengdurice <[email protected]> Co-authored-by: Zhipeng Wang <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

The initialization of DeepCompile+Z1/2 now fails due to the change introduced in deepspeedai#7509. This PR resolves the issue by: - Adding an argument to optimizer.get_flat_partition - Skipping the entire allreduce function in the engine --------- Signed-off-by: Masahiro Tanaka <[email protected]>

The original Muon optimizer PR (deepspeedai#7509) requires user to explicitly set `use_muon` flags in `model.parameters()`, as shown in test https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27 . This PR integrate setting of `use_muon` into DeepSpeed before engine initialization. This makes Muon optimizer easier to use. User only needs to change optimizer in `config.json` from `AdamW` to `Muon`, no need to change code. It will solve the following issue deepspeedai#7552 --------- Signed-off-by: Ma, Guokai <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>

harshitkd · 2025-11-13T10:39:28Z

How can I use Muon for stage 3? As I see it, it has not yet been applied to stage 3.

pengdurice · 2025-11-18T04:06:23Z

How can I use Muon for stage 3? As I see it, it has not yet been applied to stage 3.

Hi, team is working on enabling on stage 3. Stay tuned. Thanks!

pengdumle and others added 12 commits August 19, 2025 21:44

add code changes and push to my branch

0d2ac8b

bring back some comments

b43a475

fix default use_muon to be backward compatible

8fd098f

fix some issues at copying code from test to branch

43c6db8

add muon change

1098b59

add unit test case

abd4328

change wording

bc42a55

assert optimizer

d80613b

make sure initialization works

98c7f96

enable only updating every LGA steps

68941d9

enable only updating every LGA steps

3e75372

Merge branch 'deepspeedai:master' into peng-add-muon-v1

f33e7c7

PKUWZP requested a review from sfc-gh-truwase August 23, 2025 16:44

PKUWZP requested review from loadams, tjruwase and tohtana as code owners August 23, 2025 16:44

PKUWZP and others added 14 commits August 23, 2025 13:05

Clarify Muon dependency in setup.py

c894dcc

Adding Muon dependencies to setup.py file.

try to add to install_requires and see if it fix

d5fd925

fix install requires and add copyright for

aa40845

Fix formatting in setup.py for muon dependency

e081773

fix conflicts

9c5e344

use original muon directly in the code

16a9d73

use original muon directly in the code, fix deepspeed comm error and …

9d55218

…add init file

break the torch distributed error

d998801

add licence

057c9f2

add licence

64a0232

add licence fix yapf

8ba1f06

Fix the end-of-file error.

bfd9260

Fix the end-of-file formatting error.

044d9e8

afix eof error

35ef8e3

Zhipeng Wang and others added 5 commits August 23, 2025 20:33

Fix the License and Copyright.

720e2d4

Fix the MIT License for original Muon Implementation.

e695af9

Fix the license issue in original Muon Implementation.

08b5c08

Fix Copyright in test_muon.py

363cbc6

Merge branch 'master' into peng-add-muon-v1

a888126

sfc-gh-truwase reviewed Aug 25, 2025

View reviewed changes

deepspeed/runtime/zero/muon/original_muon.py Show resolved Hide resolved

deepspeed/runtime/zero/muon/muon_optimizer.py Show resolved Hide resolved

sfc-gh-truwase reviewed Aug 25, 2025

View reviewed changes

tests/unit/ops/muon/test_muon.py Show resolved Hide resolved

sfc-gh-truwase approved these changes Aug 26, 2025

View reviewed changes

Merge branch 'master' into peng-add-muon-v1

a2c95b0

PKUWZP merged commit 66ad278 into deepspeedai:master Aug 27, 2025
12 checks passed

This was referenced Sep 8, 2025

DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… #7489

Merged

Fix gradient buffer access for DeepCompile Z1/2 #7548

Merged

delock mentioned this pull request Sep 11, 2025

Make Muon optimizer easier to enable #7555

Merged

zinccat mentioned this pull request Sep 25, 2025

Support muon with deepspeed axolotl-ai-cloud/axolotl#3184

Open

5 tasks

snorkelopstesting1-a11y pushed a commit to snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7509_7e1cc7b8-ec22-4b20-8636-2207f88cabf3 that referenced this pull request Oct 2, 2025

Enabling Muon Optimizer in DeepSpeed

bde94c6

Original PR #7509 by PKUWZP Original: deepspeedai/DeepSpeed#7509

snorkelopstesting1-a11y mentioned this pull request Oct 2, 2025

Enabling Muon Optimizer in DeepSpeed snorkel-marlin-repos/deepspeedai_DeepSpeed_pr_7509_7e1cc7b8-ec22-4b20-8636-2207f88cabf3#1

Merged

iamlockelightning mentioned this pull request Oct 9, 2025

[question] Dose Muon Optimizer support deepspeed ? hiyouga/LlamaFactory#8041

Open

1 task

lhl mentioned this pull request Nov 11, 2025

Allow muon optimizer with DeepSpeed Zero 1-2 axolotl-ai-cloud/axolotl#3258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling Muon Optimizer in DeepSpeed #7509

Enabling Muon Optimizer in DeepSpeed #7509

PKUWZP commented Aug 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

delock commented Sep 16, 2025

Uh oh!

harshitkd commented Nov 13, 2025 •

edited

Loading

Uh oh!

pengdurice commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Enabling Muon Optimizer in DeepSpeed #7509

Enabling Muon Optimizer in DeepSpeed #7509

Conversation

PKUWZP commented Aug 23, 2025

Introduction

Issues and solutions

Issues

Solutions

Future directions and roadmap

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

delock commented Sep 16, 2025

Uh oh!

harshitkd commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengdurice commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

harshitkd commented Nov 13, 2025 •

edited

Loading