Skip to content

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Oct 25, 2025

It looks like save_checkpoint expects get_model_parallel_* API in the mpu object. So adding it to the Ulysses slim mpu version.

This solves this problem in HF Trainer:

[rank1]:   File "/code/users/stas/github/transformers-alst-integration/src/transformers/trainer.py", line 3248, in _save_optimizer_and_scheduler
[rank1]:     self.model_wrapped.save_checkpoint(output_dir)
[rank1]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/engine.py", line 3497, in save_checkpoint
[rank1]:     self._save_checkpoint(save_dir,
[rank1]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/engine.py", line 3709, in _save_checkpoint
[rank1]:     save_path = self._get_ckpt_name(save_dir, tag)
[rank1]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/code/users/stas/github/DeepSpeed/deepspeed/runtime/engine.py", line 3039, in _get_ckpt_name
[rank1]:     mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()
[rank1]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: AttributeError: module 'deepspeed.runtime.sequence_parallel.parallel_state_sp' has no attribute 'get_model_parallel_rank'. Did you mean: 'get_sequence_parallel_rank'?

Signed-off-by: Stas Bekman <[email protected]>
@stas00 stas00 merged commit 433e3c7 into master Oct 28, 2025
12 checks passed
@stas00 stas00 deleted the stas/ulysses-mpu branch October 28, 2025 03:43
@stas00
Copy link
Collaborator Author

stas00 commented Oct 28, 2025

Thank you Tunji!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants