ZeRO3: Improve mismatch detection #7525

sfc-gh-truwase · 2025-08-29T13:44:41Z

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents.

Motivated by #7461 (comment)

Signed-off-by: Olatunji Ruwase <[email protected]>

deepspeed/runtime/zero/config.py

deepspeed/runtime/zero/utils.py

Signed-off-by: Olatunji Ruwase <[email protected]>

stas00

Added some small suggestions, but looking good otherwise. Thank you, Tunji

deepspeed/runtime/engine.py

deepspeed/runtime/zero/utils.py

Co-authored-by: Stas Bekman <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]>

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by deepspeedai#7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: jakehemmerle <[email protected]>

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by #7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by deepspeedai#7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Flakes342 <[email protected]>

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by deepspeedai#7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

Detect list len mismatches

641d86f

Signed-off-by: Olatunji Ruwase <[email protected]>

sfc-gh-truwase requested a review from stas00 August 29, 2025 13:44

sfc-gh-truwase requested review from tjruwase and tohtana as code owners August 29, 2025 13:44

sfc-gh-truwase added 3 commits August 29, 2025 13:46

Revert

b0b6bd6

Signed-off-by: Olatunji Ruwase <[email protected]>

Z3 sanity check option

cbf3d66

Revert

ceef875

Signed-off-by: Olatunji Ruwase <[email protected]>

tohtana reviewed Aug 29, 2025

View reviewed changes

deepspeed/runtime/zero/config.py Outdated Show resolved Hide resolved

tohtana reviewed Aug 29, 2025

View reviewed changes

deepspeed/runtime/zero/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase added 3 commits August 29, 2025 16:19

Minor tweaks

1a11c18

Signed-off-by: Olatunji Ruwase <[email protected]>

Improve error message format

ffdccf2

Signed-off-by: Olatunji Ruwase <[email protected]>

Improve error message format

498e69c

Signed-off-by: Olatunji Ruwase <[email protected]>

stas00 approved these changes Aug 29, 2025

View reviewed changes

deepspeed/runtime/engine.py Outdated Show resolved Hide resolved

deepspeed/runtime/zero/utils.py Outdated Show resolved Hide resolved

deepspeed/runtime/zero/utils.py Outdated Show resolved Hide resolved

sfc-gh-truwase and others added 5 commits August 29, 2025 16:55

Update deepspeed/runtime/zero/utils.py

d6b3b74

Co-authored-by: Stas Bekman <[email protected]>

Update deepspeed/runtime/engine.py

5aad574

Co-authored-by: Stas Bekman <[email protected]>

PR feedback

0b6145f

Signed-off-by: Olatunji Ruwase <[email protected]>

Add list length

948f777

Signed-off-by: Olatunji Ruwase <[email protected]>

Merge branch 'master' into sfc-gh-truwase/detect_z3_state_mismatch

05f1e97

sfc-gh-truwase merged commit eabb687 into master Aug 31, 2025
12 checks passed

sfc-gh-truwase deleted the sfc-gh-truwase/detect_z3_state_mismatch branch August 31, 2025 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZeRO3: Improve mismatch detection #7525

ZeRO3: Improve mismatch detection #7525

Uh oh!

sfc-gh-truwase commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

stas00 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZeRO3: Improve mismatch detection #7525

ZeRO3: Improve mismatch detection #7525

Uh oh!

Conversation

sfc-gh-truwase commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants