Skip to content

Conversation

@Rakshit-gen
Copy link
Contributor

@Rakshit-gen Rakshit-gen commented Dec 20, 2025

Summary

  • Fix AttributeError: 'str' object has no attribute 'tag' when using Nebula checkpoint engine
  • Pass CheckpointCommitInfo object instead of raw tag string to checkpoint_engine.commit()

Description

The CheckpointEngine.commit() interface expects a CheckpointCommitInfo object, but two call sites in engine.py were passing a raw tag string instead:

  1. save_checkpoint() at line 3695
  2. save_16bit_model() at line 4230

This worked with TorchCheckpointEngine because it ignores the parameter, but NebulaCheckpointEngine accesses info.tag, causing the crash.

Changes

  • Line 3695: Create CheckpointCommitInfo object before calling commit()
  • Line 4231: Use existing commit_info variable instead of tag

Test plan

  • Verify Nebula checkpoint engine saves work without AttributeError
  • Verify TorchCheckpointEngine still works (no regression)
  • Run existing checkpoint-related unit tests

Fixes #7678

Update checkpoint_engine.commit() calls to pass CheckpointCommitInfo
object instead of just the tag string, allowing more information to be
passed to the checkpoint engine.

Signed-off-by: Rakshit-gen <[email protected]>
@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase can we review this change?

@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) December 22, 2025 12:46
@sfc-gh-truwase sfc-gh-truwase merged commit 6d9c3dc into deepspeedai:master Dec 22, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] nebula checkpoint engine AttributeError: 'str' object has no attribute 'tag'

2 participants