Skip to content

Conversation

@zhengchenyu
Copy link
Contributor

In a multi-machine environment, loading the stage3 universal checkpoint will produce incorrect results, causing the loss to increase abnormally.

@zhengchenyu
Copy link
Contributor Author

It seems that unit testing currently only works in a single-machine environment. Therefore, I'll use my own examples in for verification. Env: 2 node, 8 gpu per node. There are three experiments:

  • base The base example is a non-interrupt, non-checkpoint example.
  • bug The bug is based on the current master version of the code, with checkpoint at step 20, then convert to universal checkpoint, and then load the universal checkpoint. Abnormal increase in loss.
  • fix The fix is ​​based on this PR, also with checkpoint at step 20, then convert to universal checkpoint, and then reload the universal checkpoint. It can be seen that the loss is normal.
截屏2025-09-28 19 37 40

Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Thank you for the fix!

@tohtana tohtana enabled auto-merge (squash) September 28, 2025 17:17
@tohtana tohtana merged commit 47b3fb5 into deepspeedai:master Sep 28, 2025
13 of 14 checks passed
@zhengchenyu zhengchenyu deleted the fix.load.universal.multi.nodes branch September 29, 2025 01:39
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…ine mode. (deepspeedai#7601)

In a multi-machine environment, loading the stage3 universal checkpoint
will produce incorrect results, causing the loss to increase abnormally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants