Skip to content

Conversation

@tohtana
Copy link
Collaborator

@tohtana tohtana commented Sep 2, 2025

This PR includes these two fixes:

  • Use GradScaler only for FP16 (not for BF16)
  • Fix dtype conversion for ZeRO3 allgather
    • The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters).
    • Currently, the hook is triggered at each tied layer because we temporarily set .data with a different dtype.
    • The fix ensures that the parameter consistently retains the same dtype.

tohtana and others added 11 commits September 2, 2025 17:01
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
#7461 (comment)

---------

Signed-off-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
fix typo s/1014 /1024
         s/was_interruptted /was_interrupted

detail info
        modified:   deepspeed/autotuning/scheduler.py
        modified:   deepspeed/autotuning/utils.py

Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
replay #3019 as it got
reverted

Signed-off-by: Masahiro Tanaka <[email protected]>
This PR removes some and enables removing other startup noise -
especially when it's replicated rank-times and doesn't carry any
informative payload.

1. add `--log_level` flag which sets the launcher's logger to a desired
setting - defaulting to `logging.INFO` for now for BC, but will change
to `logging.WARNING` in v1
2. add `--quiet/-q` flag which sets the launcher's logger to
`logging.ERROR` which essentially disables startup info messages
3. change the logging defaults elsewhere to `logging.WARNING` (main
impact is the accelerator.py), once deepspeed started the frameworks
control its loglevel for each rank, so the tricky part is this pre-start
stage info logs. this part is breaking BC as there is no machinery to
set the logger level for `real_accelerator.py`)
4. builder is changed to non-verbose (BC breaking)

---------

Signed-off-by: Stas Bekman <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
MoE tutorial fixes:
1. cifar example has been moved - fix the url
2. fixing text and improving markup

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
…x_lr` (#7530)

Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
fixed DeepSpeedCPULion with ZeRO-Offload bug
[issues/7524](#7524)

Signed-off-by: Qi Bin <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
@tohtana tohtana force-pushed the tohtana/fix_autocast_scaler branch from 97a71fc to f2f7780 Compare September 3, 2025 00:01
@tohtana tohtana requested a review from jomayeri as a code owner September 3, 2025 00:01
@tohtana tohtana enabled auto-merge (squash) September 3, 2025 01:07
@tohtana tohtana merged commit 1e183a6 into master Sep 3, 2025
12 checks passed
@tohtana tohtana deleted the tohtana/fix_autocast_scaler branch September 3, 2025 01:22
Flakes342 pushed a commit to Flakes342/DeepSpeed that referenced this pull request Sep 9, 2025
This PR includes these two fixes:
- Use GradScaler only for FP16 (not for BF16)
- Fix dtype conversion for ZeRO3 allgather
- The reduce hook should be called only once, even when a parameter is
shared across multiple layers (tied parameters).
- Currently, the hook is triggered at each tied layer because we
temporarily set `.data` with a different dtype.
- The fix ensures that the parameter consistently retains the same
dtype.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Signed-off-by: Qi Bin <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: digger yu <[email protected]>
Co-authored-by: Jake Hemmerle <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Qi Bin <[email protected]>
Signed-off-by: Flakes342 <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
This PR includes these two fixes:
- Use GradScaler only for FP16 (not for BF16)
- Fix dtype conversion for ZeRO3 allgather
- The reduce hook should be called only once, even when a parameter is
shared across multiple layers (tied parameters).
- Currently, the hook is triggered at each tied layer because we
temporarily set `.data` with a different dtype.
- The fix ensures that the parameter consistently retains the same
dtype.

---------

Signed-off-by: Masahiro Tanaka <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: jakehemmerle <[email protected]>
Signed-off-by: Qi Bin <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: digger yu <[email protected]>
Co-authored-by: Jake Hemmerle <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Qi Bin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants