Enable non-ZeRO mode #7515

sfc-gh-truwase · 2025-08-25T16:11:14Z

Enabled via stage=0 which corresponds to DDP.
Remove hardwired path to b16_optimizer.
Enabletorch.autocast for DDP training
Enable native mixed precision DDP for bfloat16
Update torch.autocast and native mixed precision UTs

Signed-off-by: Olatunji Ruwase <[email protected]>

sfc-gh-truwase · 2025-08-25T16:12:30Z

@tohtana I have extended your torch.autocast PR to non-ZeRO case and also added some docs. Can you please review?

sfc-gh-truwase · 2025-08-25T16:13:28Z

@stas00 FYI.

Signed-off-by: Olatunji Ruwase <[email protected]>

…speedai/DeepSpeed into sfc-gh-truwase/disable_zero

PKUWZP · 2025-08-26T05:28:10Z

@sfc-gh-truwase Can you check why the cpu-torch unit tests are failing? Also can we say for non-ZeRO training, we fully rely on torch.autocast ?

Antlera · 2025-08-26T19:51:35Z

The cpu-torch CI fails due to the strict check in engine._do_optimizer_sanity_check:

elif model_dtype == grad_accum_dtype:
    if model_dtype == torch.bfloat16:
        if self.pipeline_parallelism:
            logger.warning(...)
            return BFLOAT16
        else:
            raise NotImplementedError(...)

It only allows BF16 accumulation when PP is enabled.
Suggest relaxing this to a warning for non-ZeRO/non-PP cases instead of raising NotImplementedError.

sfc-gh-truwase · 2025-08-26T19:55:52Z

@Antlera and @PKUWZP I found the cause, but a bit delayed in pushing a fix. Thanks for reviewing.

sfc-gh-truwase · 2025-08-26T23:02:56Z

Also can we say for non-ZeRO training, we fully rely on torch.autocast ?

@PKUWZP, great question. No, non-ZeRO can also use native mixed-precision training. Below is how this PR expands the mixed precision training options of DeepSpeed:

Signed-off-by: Olatunji Ruwase <[email protected]>

stas00

Thank you for making stage=0 work, Tunji

The PR is looking good, added a few minor suggestions

deepspeed/runtime/fp16/fused_optimizer.py

docs/code-docs/source/training.rst

tests/unit/runtime/test_ds_initialize.py

Co-authored-by: Stas Bekman <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]>

Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Ma, Guokai <[email protected]>

Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

sfc-gh-truwase added 2 commits August 25, 2025 15:18

Enable non-ZeRO run

19db781

Signed-off-by: Olatunji Ruwase <[email protected]>

torch.autocast docs

93814af

Signed-off-by: Olatunji Ruwase <[email protected]>

sfc-gh-truwase requested review from sfc-gh-mwyatt, stas00 and tohtana August 25, 2025 16:11

sfc-gh-truwase requested review from loadams and tjruwase as code owners August 25, 2025 16:11

Merge branch 'master' into sfc-gh-truwase/disable_zero

27ee9f6

sfc-gh-truwase added 2 commits August 25, 2025 16:14

Enable UT on modal

e114f2e

Signed-off-by: Olatunji Ruwase <[email protected]>

Merge branch 'sfc-gh-truwase/disable_zero' of https://github.com/deep…

d8dde9f

…speedai/DeepSpeed into sfc-gh-truwase/disable_zero

sfc-gh-truwase changed the title ~~Enable non-ZeRO run~~ Enable non-ZeRO mode Aug 25, 2025

sfc-gh-truwase added 11 commits August 26, 2025 23:03

Enable DDP bf16 training

b97b870

Signed-off-by: Olatunji Ruwase <[email protected]>

Fix dtype

4b8f961

Signed-off-by: Olatunji Ruwase <[email protected]>

Fix UT

81c7bc5

Signed-off-by: Olatunji Ruwase <[email protected]>

Merge branch 'master' into sfc-gh-truwase/disable_zero

f914f61

Docs

865c3fa

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

a12d188

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

5db1e90

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

b7d77bd

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

0a63b9c

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

748e615

Signed-off-by: Olatunji Ruwase <[email protected]>

Docs

ce74b7d

Signed-off-by: Olatunji Ruwase <[email protected]>

stas00 approved these changes Aug 27, 2025

View reviewed changes

Update docs/code-docs/source/training.rst

070746a

Co-authored-by: Stas Bekman <[email protected]>

sfc-gh-truwase and others added 5 commits August 27, 2025 13:02

Update docs/code-docs/source/training.rst

6097e3a

Co-authored-by: Stas Bekman <[email protected]>

Update docs/code-docs/source/training.rst

27b8fed

Co-authored-by: Stas Bekman <[email protected]>

Update docs/code-docs/source/training.rst

915ab4e

Co-authored-by: Stas Bekman <[email protected]>

Update docs/code-docs/source/training.rst

efee111

Co-authored-by: Stas Bekman <[email protected]>

Fix scaling assert

c463334

Signed-off-by: Olatunji Ruwase <[email protected]>

sfc-gh-truwase merged commit 889f0ea into master Aug 27, 2025
12 checks passed

sfc-gh-truwase deleted the sfc-gh-truwase/disable_zero branch August 27, 2025 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable non-ZeRO mode #7515

Enable non-ZeRO mode #7515

Uh oh!

sfc-gh-truwase commented Aug 25, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Aug 25, 2025

Uh oh!

sfc-gh-truwase commented Aug 25, 2025

Uh oh!

PKUWZP commented Aug 26, 2025

Uh oh!

Antlera commented Aug 26, 2025

Uh oh!

sfc-gh-truwase commented Aug 26, 2025

Uh oh!

sfc-gh-truwase commented Aug 26, 2025

Uh oh!

stas00 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Enable non-ZeRO mode #7515

Enable non-ZeRO mode #7515

Uh oh!

Conversation

sfc-gh-truwase commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented Aug 25, 2025

Uh oh!

sfc-gh-truwase commented Aug 25, 2025

Uh oh!

PKUWZP commented Aug 26, 2025

Uh oh!

Antlera commented Aug 26, 2025

Uh oh!

sfc-gh-truwase commented Aug 26, 2025

Uh oh!

sfc-gh-truwase commented Aug 26, 2025

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sfc-gh-truwase commented Aug 25, 2025 •

edited

Loading