-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix OnebitLamb NaN propagation with empty parameters #7736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix OnebitLamb NaN propagation with empty parameters #7736
Conversation
|
@Rakshit-gen thanks for the PR. By empty parameters, do you mean the model no parameters? If so, can you give more details of how this can happen as it is quite uncommon to me? |
|
@sfc-gh-truwase Prevents division by zero when exp_avg has 0 elements. Empty parameter groups or filtered-out parameters can leave empty optimizer state tensors, causing crashes during momentum scale computation. The check avoids this edge case. |
Fixed a critical issue where OnebitLamb optimizer would produce NaNs when optimizing models with empty parameters (numel=0). The scaling factor calculation involved division by sqrt(numel), which resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated to the global scaling coefficient, corrupting the state of all other parameters. Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure safe division. Signed-off-by: Rakshit-gen <[email protected]>
93ec43e to
eed7bd6
Compare
|
Thanks for the explanation. Unfortunately, that does not address my question of in what scenarios would a model have all its parameters filtered out. Or more precisely, why would a parameter group be empty? If parameters don't exist then optimizer state is not needed, and I would expect this code path not be executed at all. So I feel such situations out to be handled at the higher logical level. Alternatively, can you create a small example to demonstrating this problem? That would be something to add to our unit tests as well. Thanks |
When torch.numel(exp_avg) == 0 (empty parameter), the original code does: |
|
Thanks for sharing the example code, it is very helpful. I believe the correct solution is to avoid passing empty parameters to the optimizer since there is nothing to optimize. As a reference, this how frozen parameters (i.e., |
Filter out parameters with numel==0 in OnebitLamb.__init__() before passing them to the optimizer, similar to how frozen parameters are handled. This prevents NaN propagation that occurred when computing scaling coefficients for empty parameters. - Filter empty parameters in __init__ before creating optimizer groups - Revert workaround fix in step() method since empty params won't reach it - Add unit test to verify empty parameters are filtered and training proceeds without NaN Signed-off-by: Rakshit-gen <[email protected]>
|
@sfc-gh-truwase Agreed. Updated the code to filter out empty parameters (numel() == 0) in OnebitLamb.init() before creating optimizer groups, following the same pattern as frozen parameters. Reverted the workaround and added a unit test. Changes are in the latest commit. |
|
@Rakshit-gen thanks, I will review asap. By the way, are you able to share how you are using OneBitLamb? I am unaware of many users. Thanks. |
|
I used OneBitLamb when training a BERT model across multiple machines. Gradient synchronization between machines was slow and slowed training. OneBitLamb compresses the data sent between machines, so communication is faster. |
|
@sfc-gh-truwase can we merge this now, the pr passed all the tests? are there any blockers? |
|
@Rakshit-gen the only blocker is my review, and will do asap. Sorry for the delay, |
|
No issues @sfc-gh-truwase do tell me if there is any other issue i can help with! |
|
@sfc-gh-truwase any updates on this? |
…speed/runtime/utils.py. This utility can now be reused by other optimizers that need to filter empty parameters. Signed-off-by: Rakshit-gen <[email protected]>
|
Thanks @sfc-gh-truwase |
|
@Rakshit-gen thanks for the contribution and willingness to do more. We can definitely use your help. However, things are slowing down now for the Holidays, so I will reach out again in the New Year. |
|
@sfc-gh-truwase Thanks for the update — appreciate it. Totally understand the holiday slowdown. Looking forward to reconnecting in the New Year. Enjoy the holidays. |
Fixed a critical issue where OnebitLamb optimizer would produce NaNs when optimizing models with empty parameters (numel=0).
The scaling factor calculation involved division by sqrt(numel), which resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated to the global scaling coefficient, corrupting the state of all other parameters.
Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure safe division.