Skip to content

Conversation

@Rakshit-gen
Copy link
Contributor

Fixed a critical issue where OnebitLamb optimizer would produce NaNs when optimizing models with empty parameters (numel=0).

The scaling factor calculation involved division by sqrt(numel), which resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated to the global scaling coefficient, corrupting the state of all other parameters.

Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure safe division.

@sfc-gh-truwase
Copy link
Collaborator

@Rakshit-gen thanks for the PR.

By empty parameters, do you mean the model no parameters? If so, can you give more details of how this can happen as it is quite uncommon to me?

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase Prevents division by zero when exp_avg has 0 elements. Empty parameter groups or filtered-out parameters can leave empty optimizer state tensors, causing crashes during momentum scale computation. The check avoids this edge case.

Fixed a critical issue where OnebitLamb optimizer would produce NaNs
when optimizing models with empty parameters (numel=0).

The scaling factor calculation involved division by sqrt(numel), which
resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated
to the global scaling coefficient, corrupting the state of all other parameters.

Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure
safe division.

Signed-off-by: Rakshit-gen <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

Thanks for the explanation. Unfortunately, that does not address my question of in what scenarios would a model have all its parameters filtered out. Or more precisely, why would a parameter group be empty? If parameters don't exist then optimizer state is not needed, and I would expect this code path not be executed at all. So I feel such situations out to be handled at the higher logical level.

Alternatively, can you create a small example to demonstrating this problem? That would be something to add to our unit tests as well. Thanks

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase

import torch
import torch.nn as nn
import numpy as np

# Create a model with a normal parameter and an empty parameter
class ModelWithEmptyParam(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 10)  # Normal parameter
        # Empty parameter (0 elements) - can happen in conditional architectures
        self.empty_param = nn.Parameter(torch.empty(0, 10))
    
    def forward(self, x):
        return self.linear(x)

model = ModelWithEmptyParam()

# Create parameter groups - one with normal params, one with empty param
param_groups = [
    {'params': [model.linear.weight, model.linear.bias], 'weight_decay': 0.01},
    {'params': [model.empty_param], 'weight_decay': 0.0}  # Empty parameter group
]

# Simulate the problematic calculation from OnebitLamb
# This is what happens in line 180-181 of lamb.py
for group in param_groups:
    for p in group['params']:
        exp_avg = torch.zeros_like(p.data)  # Simulated momentum
        
        # BEFORE FIX (causes NaN):
        # numel = torch.numel(exp_avg)  # = 0 for empty param
        # scale = torch.linalg.vector_norm(exp_avg) / np.sqrt(numel)
        # Result: 0.0 / sqrt(0) = 0.0 / 0.0 = NaN ❌
        
        # AFTER FIX (safe):
        numel = torch.numel(exp_avg)
        scale = torch.linalg.vector_norm(exp_avg) / np.sqrt(numel if numel > 0 else 1.0)
        # Result: 0.0 / sqrt(1.0) = 0.0 ✅
        
        print(f"Param shape: {p.shape}, numel: {numel}, scale: {scale}")
        assert not torch.isnan(scale), "Should not produce NaN!"

When torch.numel(exp_avg) == 0 (empty parameter), the original code does:
vector_norm(exp_avg) = 0.0 (norm of empty tensor)
np.sqrt(0) = 0.0
0.0 / 0.0 = NaN
The fix uses max(numel, 1) or numel if numel > 0 else 1.0, so:
0.0 / sqrt(1.0) = 0.0 (no NaN)
This NaN then propagates to united_scale and corrupts all other parameters' scaling_coeff values.

@sfc-gh-truwase
Copy link
Collaborator

Thanks for sharing the example code, it is very helpful. I believe the correct solution is to avoid passing empty parameters to the optimizer since there is nothing to optimize. As a reference, this how frozen parameters (i.e., .requires_grad==False) are handled.

Filter out parameters with numel==0 in OnebitLamb.__init__() before
passing them to the optimizer, similar to how frozen parameters are
handled. This prevents NaN propagation that occurred when computing
scaling coefficients for empty parameters.

- Filter empty parameters in __init__ before creating optimizer groups
- Revert workaround fix in step() method since empty params won't reach it
- Add unit test to verify empty parameters are filtered and training proceeds without NaN

Signed-off-by: Rakshit-gen <[email protected]>
@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase Agreed. Updated the code to filter out empty parameters (numel() == 0) in OnebitLamb.init() before creating optimizer groups, following the same pattern as frozen parameters. Reverted the workaround and added a unit test. Changes are in the latest commit.

@sfc-gh-truwase
Copy link
Collaborator

@Rakshit-gen thanks, I will review asap.

By the way, are you able to share how you are using OneBitLamb? I am unaware of many users. Thanks.

@Rakshit-gen
Copy link
Contributor Author

I used OneBitLamb when training a BERT model across multiple machines. Gradient synchronization between machines was slow and slowed training. OneBitLamb compresses the data sent between machines, so communication is faster.

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase can we merge this now, the pr passed all the tests? are there any blockers?

@sfc-gh-truwase
Copy link
Collaborator

@Rakshit-gen the only blocker is my review, and will do asap. Sorry for the delay,

@Rakshit-gen
Copy link
Contributor Author

No issues @sfc-gh-truwase do tell me if there is any other issue i can help with!

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase any updates on this?

…speed/runtime/utils.py. This utility can now be reused by other optimizers that need to filter empty parameters.

Signed-off-by: Rakshit-gen <[email protected]>
@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) December 24, 2025 18:26
@sfc-gh-truwase sfc-gh-truwase merged commit b4e74a9 into deepspeedai:master Dec 24, 2025
11 checks passed
@Rakshit-gen
Copy link
Contributor Author

Thanks @sfc-gh-truwase
Do inform me if I can work on any other issues!

@sfc-gh-truwase
Copy link
Collaborator

@Rakshit-gen thanks for the contribution and willingness to do more. We can definitely use your help. However, things are slowing down now for the Holidays, so I will reach out again in the New Year.

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase Thanks for the update — appreciate it. Totally understand the holiday slowdown. Looking forward to reconnecting in the New Year. Enjoy the holidays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants