Skip to content

Conversation

@therealnaveenkamal
Copy link
Contributor

Fixes #7571

When ZeRO is disabled (stage 0) and bf16 is enabled, the current guard sets load_zero_checkpoint=True, which leads to _load_zero_checkpoint and _restore_from_bit16_weights() being called even though no ZeRO state exists.

This PR removes the self.bfloat16_enabled() condition so that load_zero_checkpoint is tied strictly to self.zero_optimization().

Stage 0 (BF16/FP16/FP32): cleanly skips ZeRO checkpoint path.

Stage ≥ 1: loads ZeRO partitioned optimizer state as before.

cc @sfc-gh-truwase

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) September 25, 2025 20:09
@sfc-gh-truwase sfc-gh-truwase merged commit b756540 into deepspeedai:master Sep 25, 2025
12 checks passed
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
Fixes deepspeedai#7571 

When ZeRO is disabled (stage 0) and bf16 is enabled, the current guard
sets `load_zero_checkpoint=True`, which leads to `_load_zero_checkpoint`
and `_restore_from_bit16_weights()` being called even though no ZeRO
state exists.

This PR removes the `self.bfloat16_enabled()` condition so that
load_zero_checkpoint is tied strictly to `self.zero_optimization()`.

Stage 0 (BF16/FP16/FP32): cleanly skips ZeRO checkpoint path.

Stage ≥ 1: loads ZeRO partitioned optimizer state as before.

cc @sfc-gh-truwase

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError: 'FP16_Optimizer' object has no attribute '_restore_from_bit16_weights'

2 participants