replace moe checkpoint dp_world_size with seq_dp_world_size #7732

wukong1992 · 2025-12-18T07:31:42Z

Replace moe checkpoint dp_world_size with seq_dp_world_size to sup moe module with seq parallel.

…q parallel

stas00 · 2025-12-18T18:35:30Z

Why are you proposing to do that?

If you need seq_dp_world_size on resume and for some reason it's not there then probably store it additionally while leaving dp_world_size alone, no?

wukong1992 · 2025-12-19T10:01:43Z

Why are you proposing to do that?

If you need seq_dp_world_size on resume and for some reason it's not there then probably store it additionally while leaving dp_world_size alone, no?

cause no moe save checkpoint use this

DeepSpeed/deepspeed/runtime/engine.py

Line 3908 in 37ad0c0

dp_world_size=self.seq_dp_world_size,

There is no need to redefine seq_dp_world_size here.

stas00 · 2025-12-19T18:31:22Z

Thank you for explaining, then yes, it checks out.

@sfc-gh-truwase, I know you weren't involved in DS-MoE - but shouldn't _save_checkpoint and _save_moe_checkpoint follow the exact same recipe wrt saving state_dict (the

DeepSpeed/deepspeed/runtime/engine.py

Line 3891 in 37ad0c0

state = dict(module=module,

DeepSpeed/deepspeed/runtime/engine.py

Line 3822 in 37ad0c0

state = {

only model_state_dict is what's different - perhaps abstracting this last bit into a helper util and having both functions use it, with the only difference passing the model_state_dict that differs? the moe saving one also saves num_experts but it's not using it on load. I think it's diverging in other aspects since nobody bothered to keep in in sync.

stas00

Thank you, @wukong1992

replace moe checkpoint dp_world_size with seq_dp_world_size to sup se…

9be4710

…q parallel

wukong1992 requested review from tjruwase and tohtana as code owners December 18, 2025 07:31

sfc-gh-truwase requested a review from stas00 December 18, 2025 18:24

Merge branch 'master' into master

37ad0c0

stas00 approved these changes Dec 19, 2025

View reviewed changes

stas00 merged commit 377a0d1 into deepspeedai:master Dec 19, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

replace moe checkpoint dp_world_size with seq_dp_world_size #7732

replace moe checkpoint dp_world_size with seq_dp_world_size #7732

Uh oh!

wukong1992 commented Dec 18, 2025

Uh oh!

stas00 commented Dec 18, 2025

Uh oh!

wukong1992 commented Dec 19, 2025

Uh oh!

stas00 commented Dec 19, 2025 •

edited

Loading

Uh oh!

stas00 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

replace moe checkpoint dp_world_size with seq_dp_world_size #7732

replace moe checkpoint dp_world_size with seq_dp_world_size #7732

Uh oh!

Conversation

wukong1992 commented Dec 18, 2025

Uh oh!

stas00 commented Dec 18, 2025

Uh oh!

wukong1992 commented Dec 19, 2025

Uh oh!

stas00 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stas00 commented Dec 19, 2025 •

edited

Loading