Skip to content

Conversation

@wukong1992
Copy link
Contributor

Replace moe checkpoint dp_world_size with seq_dp_world_size to sup moe module with seq parallel.

@stas00
Copy link
Collaborator

stas00 commented Dec 18, 2025

Why are you proposing to do that?

If you need seq_dp_world_size on resume and for some reason it's not there then probably store it additionally while leaving dp_world_size alone, no?

@wukong1992
Copy link
Contributor Author

Why are you proposing to do that?

If you need seq_dp_world_size on resume and for some reason it's not there then probably store it additionally while leaving dp_world_size alone, no?

cause no moe save checkpoint use this

dp_world_size=self.seq_dp_world_size,

There is no need to redefine seq_dp_world_size here.

@stas00
Copy link
Collaborator

stas00 commented Dec 19, 2025

Thank you for explaining, then yes, it checks out.

@sfc-gh-truwase, I know you weren't involved in DS-MoE - but shouldn't _save_checkpoint and _save_moe_checkpoint follow the exact same recipe wrt saving state_dict (the

state = dict(module=module,

only model_state_dict is what's different - perhaps abstracting this last bit into a helper util and having both functions use it, with the only difference passing the model_state_dict that differs? the moe saving one also saves num_experts but it's not using it on load. I think it's diverging in other aspects since nobody bothered to keep in in sync.

Copy link
Collaborator

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @wukong1992

@stas00 stas00 merged commit 377a0d1 into deepspeedai:master Dec 19, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants