-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix scaling and allgather with torch.autocast
#7534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5bdc312 to
5173baa
Compare
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by #7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
fix typo s/1014 /1024
s/was_interruptted /was_interrupted
detail info
modified: deepspeed/autotuning/scheduler.py
modified: deepspeed/autotuning/utils.py
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
replay #3019 as it got reverted Signed-off-by: Masahiro Tanaka <[email protected]>
This PR removes some and enables removing other startup noise - especially when it's replicated rank-times and doesn't carry any informative payload. 1. add `--log_level` flag which sets the launcher's logger to a desired setting - defaulting to `logging.INFO` for now for BC, but will change to `logging.WARNING` in v1 2. add `--quiet/-q` flag which sets the launcher's logger to `logging.ERROR` which essentially disables startup info messages 3. change the logging defaults elsewhere to `logging.WARNING` (main impact is the accelerator.py), once deepspeed started the frameworks control its loglevel for each rank, so the tricky part is this pre-start stage info logs. this part is breaking BC as there is no machinery to set the logger level for `real_accelerator.py`) 4. builder is changed to non-verbose (BC breaking) --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
MoE tutorial fixes: 1. cifar example has been moved - fix the url 2. fixing text and improving markup --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
…x_lr` (#7530) Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
fixed DeepSpeedCPULion with ZeRO-Offload bug [issues/7524](#7524) Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
97a71fc to
f2f7780
Compare
sfc-gh-truwase
approved these changes
Sep 3, 2025
Flakes342
pushed a commit
to Flakes342/DeepSpeed
that referenced
this pull request
Sep 9, 2025
This PR includes these two fixes: - Use GradScaler only for FP16 (not for BF16) - Fix dtype conversion for ZeRO3 allgather - The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters). - Currently, the hook is triggered at each tied layer because we temporarily set `.data` with a different dtype. - The fix ensures that the parameter consistently retains the same dtype. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: digger yu <[email protected]> Co-authored-by: Jake Hemmerle <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Qi Bin <[email protected]> Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95
pushed a commit
to DataStates/DeepSpeed
that referenced
this pull request
Oct 4, 2025
This PR includes these two fixes: - Use GradScaler only for FP16 (not for BF16) - Fix dtype conversion for ZeRO3 allgather - The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters). - Currently, the hook is triggered at each tied layer because we temporarily set `.data` with a different dtype. - The fix ensures that the parameter consistently retains the same dtype. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: digger yu <[email protected]> Co-authored-by: Jake Hemmerle <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Qi Bin <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes these two fixes:
.datawith a different dtype.