Put Muon optimizer momentum buffer on GPU #7648

delock · 2025-10-24T14:11:47Z

This PR put Muon optimizer momentum buffer on GPU. This makes Muon optimizer executes much faster (finetune Qwen2.5-3B on 2xA100 cards, iteration time 1500ms --> 910ms). Previously this buffer is on CPU.

Signed-off-by: Guokai Ma <[email protected]>

delock · 2025-10-24T14:14:40Z

Hi @PKUWZP , I want to confirm this change with you. I saw comments saying put the momentum buffer on CPU to save device memory. So I guess the intention is to allow train larger model with Muon optimizer. But put momentum buffer on CPU also makes Muon optimizer run slower. Maybe allow Muon optimizer in ZeRO offload should work for large model.

Signed-off-by: Guokai Ma <[email protected]>

delock · 2025-11-03T03:01:34Z

Hi @PKUWZP , do you have comments for this PR? Thanks!

PKUWZP · 2025-11-03T13:33:23Z

@delock Do you have any benchmarking results?

delock · 2025-11-04T03:07:39Z

@delock Do you have any benchmarking results?

I tested with finetune Qwen2.5-3B on 2xA100 cards. global batch size is 8.

On master branch the finetune iteration time is 1430ms. With this PR the finetune iteration time is 918ms.

Profiling data shows before this change, a lot of time spent on H2D and D2H copy. After this change, there is no H2D and D2H copy in top profiled items.

delock · 2025-11-11T05:28:33Z

@delock Do you have any benchmarking results?

I tested with finetune Qwen2.5-3B on 2xA100 cards. global batch size is 8.

On master branch the finetune iteration time is 1430ms. With this PR the finetune iteration time is 918ms.

Profiling data shows before this change, a lot of time spent on H2D and D2H copy. After this change, there is no H2D and D2H copy in top profiled items.

Hi @PKUWZP , any other comments? I'll merge this PR if you can approve it. Thanks!

Signed-off-by: Ma, Guokai <[email protected]>

make muon optimizer totally running on GPU

bbb4bfc

Signed-off-by: Guokai Ma <[email protected]>

delock requested review from tjruwase and tohtana as code owners October 24, 2025 14:11

delock added 2 commits October 24, 2025 07:14

apply torch.compile to Muon optimizer

e02c0ec

Signed-off-by: Guokai Ma <[email protected]>

make torch.compile more adaptive to old pytorch version

632ab6b

Signed-off-by: Guokai Ma <[email protected]>

PKUWZP self-requested a review November 3, 2025 13:32

Merge branch 'master' into gma/muon_opti

3cbb63c

PKUWZP approved these changes Nov 25, 2025

View reviewed changes

delock enabled auto-merge (squash) November 26, 2025 02:08

delock and others added 2 commits November 26, 2025 10:08

Merge branch 'master' into gma/muon_opti

dea8492

Fix trailing space

3b6a6d9

Signed-off-by: Ma, Guokai <[email protected]>

delock merged commit 7f2f423 into master Nov 26, 2025
11 checks passed

delock deleted the gma/muon_opti branch November 26, 2025 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Put Muon optimizer momentum buffer on GPU #7648

Put Muon optimizer momentum buffer on GPU #7648

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Nov 3, 2025

Uh oh!

PKUWZP commented Nov 3, 2025

Uh oh!

delock commented Nov 4, 2025

Uh oh!

delock commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Put Muon optimizer momentum buffer on GPU #7648

Put Muon optimizer momentum buffer on GPU #7648

Uh oh!

Conversation

delock commented Oct 24, 2025

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Nov 3, 2025

Uh oh!

PKUWZP commented Nov 3, 2025

Uh oh!

delock commented Nov 4, 2025

Uh oh!

delock commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants