Fix OnebitLamb NaN propagation with empty parameters #7736

Rakshit-gen · 2025-12-18T16:07:46Z

Fixed a critical issue where OnebitLamb optimizer would produce NaNs when optimizing models with empty parameters (numel=0).

The scaling factor calculation involved division by sqrt(numel), which resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated to the global scaling coefficient, corrupting the state of all other parameters.

Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure safe division.

sfc-gh-truwase · 2025-12-19T14:10:49Z

@Rakshit-gen thanks for the PR.

By empty parameters, do you mean the model no parameters? If so, can you give more details of how this can happen as it is quite uncommon to me?

Rakshit-gen · 2025-12-19T14:44:16Z

@sfc-gh-truwase Prevents division by zero when exp_avg has 0 elements. Empty parameter groups or filtered-out parameters can leave empty optimizer state tensors, causing crashes during momentum scale computation. The check avoids this edge case.

Fixed a critical issue where OnebitLamb optimizer would produce NaNs when optimizing models with empty parameters (numel=0). The scaling factor calculation involved division by sqrt(numel), which resulted in 0.0/0.0 -> NaN for empty parameters. This NaN value propagated to the global scaling coefficient, corrupting the state of all other parameters. Changed the denominator to use max(numel, 1) or conditional 1.0 to ensure safe division. Signed-off-by: Rakshit-gen <[email protected]>

sfc-gh-truwase · 2025-12-19T15:10:18Z

Thanks for the explanation. Unfortunately, that does not address my question of in what scenarios would a model have all its parameters filtered out. Or more precisely, why would a parameter group be empty? If parameters don't exist then optimizer state is not needed, and I would expect this code path not be executed at all. So I feel such situations out to be handled at the higher logical level.

Alternatively, can you create a small example to demonstrating this problem? That would be something to add to our unit tests as well. Thanks

Rakshit-gen · 2025-12-19T15:14:41Z

@sfc-gh-truwase

import torch
import torch.nn as nn
import numpy as np

# Create a model with a normal parameter and an empty parameter
class ModelWithEmptyParam(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(10, 10)  # Normal parameter
        # Empty parameter (0 elements) - can happen in conditional architectures
        self.empty_param = nn.Parameter(torch.empty(0, 10))
    
    def forward(self, x):
        return self.linear(x)

model = ModelWithEmptyParam()

# Create parameter groups - one with normal params, one with empty param
param_groups = [
    {'params': [model.linear.weight, model.linear.bias], 'weight_decay': 0.01},
    {'params': [model.empty_param], 'weight_decay': 0.0}  # Empty parameter group
]

# Simulate the problematic calculation from OnebitLamb
# This is what happens in line 180-181 of lamb.py
for group in param_groups:
    for p in group['params']:
        exp_avg = torch.zeros_like(p.data)  # Simulated momentum
        
        # BEFORE FIX (causes NaN):
        # numel = torch.numel(exp_avg)  # = 0 for empty param
        # scale = torch.linalg.vector_norm(exp_avg) / np.sqrt(numel)
        # Result: 0.0 / sqrt(0) = 0.0 / 0.0 = NaN ❌
        
        # AFTER FIX (safe):
        numel = torch.numel(exp_avg)
        scale = torch.linalg.vector_norm(exp_avg) / np.sqrt(numel if numel > 0 else 1.0)
        # Result: 0.0 / sqrt(1.0) = 0.0 ✅
        
        print(f"Param shape: {p.shape}, numel: {numel}, scale: {scale}")
        assert not torch.isnan(scale), "Should not produce NaN!"

When torch.numel(exp_avg) == 0 (empty parameter), the original code does:
vector_norm(exp_avg) = 0.0 (norm of empty tensor)
np.sqrt(0) = 0.0
0.0 / 0.0 = NaN
The fix uses max(numel, 1) or numel if numel > 0 else 1.0, so:
0.0 / sqrt(1.0) = 0.0 (no NaN)
This NaN then propagates to united_scale and corrupts all other parameters' scaling_coeff values.

sfc-gh-truwase · 2025-12-19T15:25:58Z

Thanks for sharing the example code, it is very helpful. I believe the correct solution is to avoid passing empty parameters to the optimizer since there is nothing to optimize. As a reference, this how frozen parameters (i.e., .requires_grad==False) are handled.

Filter out parameters with numel==0 in OnebitLamb.__init__() before passing them to the optimizer, similar to how frozen parameters are handled. This prevents NaN propagation that occurred when computing scaling coefficients for empty parameters. - Filter empty parameters in __init__ before creating optimizer groups - Revert workaround fix in step() method since empty params won't reach it - Add unit test to verify empty parameters are filtered and training proceeds without NaN Signed-off-by: Rakshit-gen <[email protected]>

Rakshit-gen · 2025-12-19T15:34:37Z

@sfc-gh-truwase Agreed. Updated the code to filter out empty parameters (numel() == 0) in OnebitLamb.init() before creating optimizer groups, following the same pattern as frozen parameters. Reverted the workaround and added a unit test. Changes are in the latest commit.

sfc-gh-truwase · 2025-12-19T16:33:29Z

@Rakshit-gen thanks, I will review asap.

By the way, are you able to share how you are using OneBitLamb? I am unaware of many users. Thanks.

Rakshit-gen · 2025-12-19T17:38:37Z

I used OneBitLamb when training a BERT model across multiple machines. Gradient synchronization between machines was slow and slowed training. OneBitLamb compresses the data sent between machines, so communication is faster.

Rakshit-gen · 2025-12-19T18:54:16Z

@sfc-gh-truwase can we merge this now, the pr passed all the tests? are there any blockers?

sfc-gh-truwase · 2025-12-19T18:57:49Z

@Rakshit-gen the only blocker is my review, and will do asap. Sorry for the delay,

Rakshit-gen · 2025-12-19T19:00:18Z

No issues @sfc-gh-truwase do tell me if there is any other issue i can help with!

Rakshit-gen · 2025-12-22T15:05:23Z

@sfc-gh-truwase any updates on this?

deepspeed/runtime/fp16/onebit/lamb.py

…speed/runtime/utils.py. This utility can now be reused by other optimizers that need to filter empty parameters. Signed-off-by: Rakshit-gen <[email protected]>

Rakshit-gen · 2025-12-24T20:03:21Z

Thanks @sfc-gh-truwase
Do inform me if I can work on any other issues!

sfc-gh-truwase · 2025-12-24T23:18:24Z

@Rakshit-gen thanks for the contribution and willingness to do more. We can definitely use your help. However, things are slowing down now for the Holidays, so I will reach out again in the New Year.

Rakshit-gen · 2025-12-26T08:01:12Z

@sfc-gh-truwase Thanks for the update — appreciate it. Totally understand the holiday slowdown. Looking forward to reconnecting in the New Year. Enjoy the holidays.

Rakshit-gen requested a review from tjruwase as a code owner December 18, 2025 16:07

Rakshit-gen force-pushed the fix-onebitlamb-nan branch from 93ec43e to eed7bd6 Compare December 19, 2025 14:57

Rakshit-gen requested review from loadams and tohtana as code owners December 19, 2025 15:29

Merge branch 'master' into fix-onebitlamb-nan

957b839

sfc-gh-truwase reviewed Dec 22, 2025

View reviewed changes

deepspeed/runtime/fp16/onebit/lamb.py Outdated Show resolved Hide resolved

Refactored the filtering logic into filter_empty_parameters() in deep…

3b326fe

…speed/runtime/utils.py. This utility can now be reused by other optimizers that need to filter empty parameters. Signed-off-by: Rakshit-gen <[email protected]>

sfc-gh-truwase approved these changes Dec 22, 2025

View reviewed changes

Merge branch 'master' into fix-onebitlamb-nan

6ab67ef

sfc-gh-truwase enabled auto-merge (squash) December 24, 2025 18:26

sfc-gh-truwase merged commit b4e74a9 into deepspeedai:master Dec 24, 2025
11 checks passed

Fix OnebitLamb NaN propagation with empty parameters #7736

Fix OnebitLamb NaN propagation with empty parameters #7736

Conversation

Rakshit-gen commented Dec 18, 2025

Uh oh!

sfc-gh-truwase commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

sfc-gh-truwase commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

sfc-gh-truwase commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

sfc-gh-truwase commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

sfc-gh-truwase commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 19, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Rakshit-gen commented Dec 24, 2025

Uh oh!

sfc-gh-truwase commented Dec 24, 2025

Uh oh!

Rakshit-gen commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants