-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Enable non-ZeRO mode #7515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable non-ZeRO mode #7515
Conversation
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
|
@tohtana I have extended your |
|
@stas00 FYI. |
Signed-off-by: Olatunji Ruwase <[email protected]>
|
@sfc-gh-truwase Can you check why the cpu-torch unit tests are failing? Also can we say for non-ZeRO training, we fully rely on |
|
The elif model_dtype == grad_accum_dtype:
if model_dtype == torch.bfloat16:
if self.pipeline_parallelism:
logger.warning(...)
return BFLOAT16
else:
raise NotImplementedError(...)It only allows BF16 accumulation when PP is enabled. |
@PKUWZP, great question. No, non-ZeRO can also use native mixed-precision training. Below is how this PR expands the mixed precision training options of DeepSpeed:
|
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
stas00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making stage=0 work, Tunji
The PR is looking good, added a few minor suggestions
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Co-authored-by: Stas Bekman <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Ma, Guokai <[email protected]>
Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <[email protected]> Co-authored-by: Stas Bekman <[email protected]>

Enabled via
stage=0which corresponds to DDP.Remove hardwired path to b16_optimizer.
Enable
torch.autocastfor DDP trainingEnable native mixed precision DDP for bfloat16
Update torch.autocast and native mixed precision UTs