Skip to content

Conversation

@sfc-gh-truwase
Copy link
Collaborator

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents.

Motivated by #7461 (comment)

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Copy link
Collaborator

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some small suggestions, but looking good otherwise. Thank you, Tunji

@sfc-gh-truwase sfc-gh-truwase merged commit eabb687 into master Aug 31, 2025
12 checks passed
@sfc-gh-truwase sfc-gh-truwase deleted the sfc-gh-truwase/detect_z3_state_mismatch branch August 31, 2025 21:57
jakehemmerle pushed a commit to jakehemmerle/DeepSpeed that referenced this pull request Sep 1, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
deepspeedai#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
tohtana pushed a commit that referenced this pull request Sep 2, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
tohtana pushed a commit that referenced this pull request Sep 3, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
deepspeedai#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
deepspeedai#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
deepspeedai#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants