Fix scaling and allgather with `torch.autocast` #7534

tohtana · 2025-09-02T06:08:09Z

This PR includes these two fixes:

Use GradScaler only for FP16 (not for BF16)
Fix dtype conversion for ZeRO3 allgather
- The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters).
- Currently, the hook is triggered at each tied layer because we temporarily set .data with a different dtype.
- The fix ensures that the parameter consistently retains the same dtype.

deepspeed/runtime/zero/partition_parameters.py

Signed-off-by: Masahiro Tanaka <[email protected]>

ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by #7461 (comment) --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

fix typo s/1014 /1024 s/was_interruptted /was_interrupted detail info modified: deepspeed/autotuning/scheduler.py modified: deepspeed/autotuning/utils.py Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

replay #3019 as it got reverted Signed-off-by: Masahiro Tanaka <[email protected]>

This PR removes some and enables removing other startup noise - especially when it's replicated rank-times and doesn't carry any informative payload. 1. add `--log_level` flag which sets the launcher's logger to a desired setting - defaulting to `logging.INFO` for now for BC, but will change to `logging.WARNING` in v1 2. add `--quiet/-q` flag which sets the launcher's logger to `logging.ERROR` which essentially disables startup info messages 3. change the logging defaults elsewhere to `logging.WARNING` (main impact is the accelerator.py), once deepspeed started the frameworks control its loglevel for each rank, so the tricky part is this pre-start stage info logs. this part is breaking BC as there is no machinery to set the logger level for `real_accelerator.py`) 4. builder is changed to non-verbose (BC breaking) --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

MoE tutorial fixes: 1. cifar example has been moved - fix the url 2. fixing text and improving markup --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

…x_lr` (#7530) Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

fixed DeepSpeedCPULion with ZeRO-Offload bug [issues/7524](#7524) Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

This PR includes these two fixes: - Use GradScaler only for FP16 (not for BF16) - Fix dtype conversion for ZeRO3 allgather - The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters). - Currently, the hook is triggered at each tied layer because we temporarily set `.data` with a different dtype. - The fix ensures that the parameter consistently retains the same dtype. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: digger yu <[email protected]> Co-authored-by: Jake Hemmerle <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Qi Bin <[email protected]> Signed-off-by: Flakes342 <[email protected]>

This PR includes these two fixes: - Use GradScaler only for FP16 (not for BF16) - Fix dtype conversion for ZeRO3 allgather - The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters). - Currently, the hook is triggered at each tied layer because we temporarily set `.data` with a different dtype. - The fix ensures that the parameter consistently retains the same dtype. --------- Signed-off-by: Masahiro Tanaka <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: jakehemmerle <[email protected]> Signed-off-by: Qi Bin <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: digger yu <[email protected]> Co-authored-by: Jake Hemmerle <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Qi Bin <[email protected]>

tohtana requested a review from tjruwase as a code owner September 2, 2025 06:08

tohtana force-pushed the tohtana/fix_autocast_scaler branch from 5bdc312 to 5173baa Compare September 2, 2025 06:14

tohtana requested a review from loadams as a code owner September 2, 2025 06:14

tohtana mentioned this pull request Sep 2, 2025

Fix torch.autocast incompatibility with DeepSpeed NovaSky-AI/SkyRL#212

Merged

sfc-gh-truwase reviewed Sep 2, 2025

View reviewed changes

deepspeed/runtime/zero/partition_parameters.py Outdated Show resolved Hide resolved

tohtana and others added 11 commits September 2, 2025 17:01

use scaler only for fp16

f13ab49

Signed-off-by: Masahiro Tanaka <[email protected]>

fix autocast for z3 allgather

1a9bee1

Signed-off-by: Masahiro Tanaka <[email protected]>

fix typo s/1014 /1024 (#7528)

13aed49

fix typo s/1014 /1024 s/was_interruptted /was_interrupted detail info modified: deepspeed/autotuning/scheduler.py modified: deepspeed/autotuning/utils.py Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

undo the revert (#7536)

e243629

replay #3019 as it got reverted Signed-off-by: Masahiro Tanaka <[email protected]>

[doc] fixing moe tutorial (#7538)

18b66e4

MoE tutorial fixes: 1. cifar example has been moved - fix the url 2. fixing text and improving markup --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

remove memory allocation of unused buffer

b4d8650

Signed-off-by: Masahiro Tanaka <[email protected]>

update comment

f2f7780

Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana force-pushed the tohtana/fix_autocast_scaler branch from 97a71fc to f2f7780 Compare September 3, 2025 00:01

tohtana requested a review from jomayeri as a code owner September 3, 2025 00:01

Merge branch 'master' into tohtana/fix_autocast_scaler

c4b7b2f

sfc-gh-truwase approved these changes Sep 3, 2025

View reviewed changes

tohtana enabled auto-merge (squash) September 3, 2025 01:07

tohtana merged commit 1e183a6 into master Sep 3, 2025
12 checks passed

tohtana deleted the tohtana/fix_autocast_scaler branch September 3, 2025 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scaling and allgather with `torch.autocast` #7534

Fix scaling and allgather with `torch.autocast` #7534

Uh oh!

tohtana commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix scaling and allgather with torch.autocast #7534

Fix scaling and allgather with torch.autocast #7534

Uh oh!

Conversation

tohtana commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix scaling and allgather with `torch.autocast` #7534

Fix scaling and allgather with `torch.autocast` #7534