Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

When the world size expands from 2 to 4, then convert to universal checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file zero_pp_rank_3_mp_rank_00_model_states.pt. But this file was not produced during the last execution.
For stage3, just load the first file, that is zero_pp_rank_0_mp_rank_00_model_states.
The existing unit test TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this problem.

@zhengchenyu zhengchenyu marked this pull request as draft September 28, 2025 06:49
@zhengchenyu zhengchenyu marked this pull request as ready for review September 28, 2025 12:16
@zhengchenyu zhengchenyu marked this pull request as draft September 29, 2025 02:52
@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) October 1, 2025 10:57
@sfc-gh-truwase sfc-gh-truwase merged commit 07e76bd into deepspeedai:master Oct 1, 2025
12 checks passed
@zhengchenyu zhengchenyu deleted the fix.load.universal branch October 2, 2025 02:22
delock pushed a commit that referenced this pull request Oct 3, 2025
… when world size expansion. (#7599)

When the world size expands from 2 to 4, then convert to universal
checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file
`zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not
produced during the last execution.
For stage3, just load the first file, that is
`zero_pp_rank_0_mp_rank_00_model_states`.
The existing unit test
TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this
problem.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Guokai Ma <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
… when world size expansion. (deepspeedai#7599)

When the world size expands from 2 to 4, then convert to universal
checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file
`zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not
produced during the last execution.
For stage3, just load the first file, that is
`zero_pp_rank_0_mp_rank_00_model_states`.
The existing unit test
TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this
problem.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants