[BUG]bucket.elements is not correctly cleared in ZeRO Stage3

### Description:
I observed that `bucket.elements` is not correctly cleared in the function `__reduce_and_partition_ipg_grads` when using ZeRO Stage 3. 
Specifically, I printed the contents of `bucket.elements` along with the number of parameters in the bucket every time `__reduce_and_partition_ipg_grads` is called. As shown in the output below:

<img width="890" height="765" alt="Image" src="https://github.com/user-attachments/assets/0a18b88a-6ce7-401c-be81-fbb49b26f37a" />

The output shows that after the first reduction, `bucket.elements` is not properly reset. As a result, only a single parameter is reduced in subsequent steps.

To fix the issue, I added an explicit line to reset `bucket.elements` to `0`:

```
params_in_bucket.clear()
bucket.elements = 0
```
After applying this change, the reduction behavior returned to normal, as shown below:
<img width="909" height="81" alt="Image" src="https://github.com/user-attachments/assets/008b6a41-76c2-4501-ac5c-95a6fd94988e" />
This suggests that `params_in_bucket.clear()` does not correctly reset the bucket state.

### DeepSpeed Config:
```
ds_config = {
            "train_batch_size": args.batch_size,
            "bf16": {
                "enabled": True
            },
            "zero_optimization": {
                "stage": 3,
                "reduce_bucket_size": 5e8,
                "offload_optimizer": {
                    "device": "cpu",
                    "pin_memory": True
                },
            },
            "scheduler": {
                "type": "WarmupCosineLR",
                "params": {
                    "total_num_steps": total_steps,
                    "warmup_num_steps": int(args.warmup * total_steps)
                }
            },
            "optimizer": {
                "type": "AdamW",
                "params": {
                    "lr": args.lr,
                    "betas": [0.9, 0.999],
                    "eps": 1e-8,
                    "weight_decay": args.weight_decay
                }
            },
            "gradient_accumulation_steps": 1,
            "gradient_clipping": 1.0,
            "logging": {
                "level": "info"
            },
```
### Version Info:
DeepSpeed: `0.17.2+da60a878 `

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]bucket.elements is not correctly cleared in ZeRO Stage3 #7415

Description:

DeepSpeed Config:

Version Info:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]bucket.elements is not correctly cleared in ZeRO Stage3 #7415

Description

Description:

DeepSpeed Config:

Version Info:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions