Fixed the problem of loading universal checkpoint error in multi-machine mode. #7601

zhengchenyu · 2025-09-28T13:04:29Z

In a multi-machine environment, loading the stage3 universal checkpoint will produce incorrect results, causing the loss to increase abnormally.

…ine mode.

zhengchenyu · 2025-09-28T13:13:52Z

It seems that unit testing currently only works in a single-machine environment. Therefore, I'll use my own examples in for verification. Env: 2 node, 8 gpu per node. There are three experiments:

base The base example is a non-interrupt, non-checkpoint example.
bug The bug is based on the current master version of the code, with checkpoint at step 20, then convert to universal checkpoint, and then load the universal checkpoint. Abnormal increase in loss.
fix The fix is based on this PR, also with checkpoint at step 20, then convert to universal checkpoint, and then reload the universal checkpoint. It can be seen that the loss is normal.

tohtana

Great catch! Thank you for the fix!

…ine mode. (deepspeedai#7601) In a multi-machine environment, loading the stage3 universal checkpoint will produce incorrect results, causing the loss to increase abnormally.

Fixed the problem of loading universal checkpoint error in multi-mach…

18d3e4b

…ine mode.

zhengchenyu requested review from tjruwase and tohtana as code owners September 28, 2025 13:04

tohtana approved these changes Sep 28, 2025

View reviewed changes

tohtana enabled auto-merge (squash) September 28, 2025 17:17

Merge branch 'master' into fix.load.universal.multi.nodes

99f725a

tohtana merged commit 47b3fb5 into deepspeedai:master Sep 28, 2025
13 of 14 checks passed

zhengchenyu deleted the fix.load.universal.multi.nodes branch September 29, 2025 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed the problem of loading universal checkpoint error in multi-machine mode. #7601

Fixed the problem of loading universal checkpoint error in multi-machine mode. #7601

Uh oh!

zhengchenyu commented Sep 28, 2025

Uh oh!

zhengchenyu commented Sep 28, 2025

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixed the problem of loading universal checkpoint error in multi-machine mode. #7601

Fixed the problem of loading universal checkpoint error in multi-machine mode. #7601

Uh oh!

Conversation

zhengchenyu commented Sep 28, 2025

Uh oh!

zhengchenyu commented Sep 28, 2025

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants