Skip to content

Conversation

@eternalNight
Copy link
Contributor

When torch autocast is enabled, model weights are already in fp32 and can be directly updated by the optimizer with fp32 gradients. It is a waste of accelerator memory to keep another copy, also in fp32, as the master weight.

Use aliases to the so-called-"fp16" params as the master weights to save memory. It applies only when no optimizer offloading (either CPU or NVMe) or swapping mechanisms is enabled.

Using https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 (which enables torch autocast) as an example, the memory profile of the training startup phase is as follows:

Picture1

With this PR, the master weights no longer instantiate:

Picture2

This is also true when DeepCompile is enabled:

Picture3

When torch autocast is disabled, the master weights are preserved:

Picture4

When torch autocast is enabled, model weights are already in fp32 and
can be directly updated by the optimizer with fp32 gradients. It is a
waste of accelerator memory to keep another copy, also in fp32, as the
master weight.

Use aliases to the so-called-"fp16" params as the master weights to save
memory. It applies only when no optimizer offloading (either CPU or
NVMe) or swapping mechanisms is enabled.

Signed-off-by: Junjie Mao <[email protected]>
Copy link
Collaborator

@tohtana tohtana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eternalNight! I actually encountered this issue and was wondering how to fix it.
This is definitely a significant improvement.

@tohtana tohtana merged commit 706f6e8 into deepspeedai:master Oct 27, 2025
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants