-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Put Muon optimizer momentum buffer on GPU #7648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Guokai Ma <[email protected]>
|
Hi @PKUWZP , I want to confirm this change with you. I saw comments saying put the momentum buffer on CPU to save device memory. So I guess the intention is to allow train larger model with Muon optimizer. But put momentum buffer on CPU also makes Muon optimizer run slower. Maybe allow Muon optimizer in ZeRO offload should work for large model. |
Signed-off-by: Guokai Ma <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
|
Hi @PKUWZP , do you have comments for this PR? Thanks! |
|
@delock Do you have any benchmarking results? |
I tested with finetune Qwen2.5-3B on 2xA100 cards. global batch size is 8. On master branch the finetune iteration time is 1430ms. With this PR the finetune iteration time is 918ms. Profiling data shows before this change, a lot of time spent on H2D and D2H copy. After this change, there is no H2D and D2H copy in top profiled items. |
Hi @PKUWZP , any other comments? I'll merge this PR if you can approve it. Thanks! |
Signed-off-by: Ma, Guokai <[email protected]>
This PR put Muon optimizer momentum buffer on GPU. This makes Muon optimizer executes much faster (finetune Qwen2.5-3B on 2xA100 cards, iteration time 1500ms --> 910ms). Previously this buffer is on CPU.