Skip to content

Conversation

@jakehemmerle
Copy link
Contributor

No description provided.

sfc-gh-truwase and others added 2 commits August 31, 2025 20:18
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
deepspeedai#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Corrected the parameter name from 'cycle_min_lr' to 'cycle_max_lr' in the tutorial.

Signed-off-by: jakehemmerle <[email protected]>
@loadams loadams enabled auto-merge (squash) September 2, 2025 21:16
@loadams loadams merged commit 4d83f3f into deepspeedai:master Sep 2, 2025
2 checks passed
tohtana pushed a commit that referenced this pull request Sep 3, 2025
…x_lr` (#7530)

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
…x_lr` (deepspeedai#7530)

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…x_lr` (deepspeedai#7530)

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants