Skip to content

Conversation

@Rakshit-gen
Copy link
Contributor

@Rakshit-gen Rakshit-gen commented Dec 20, 2025

Summary

  • Add timeouts to prevent indefinite hangs when checkpoint process crashes
  • Replace assertion with proper runtime validation
  • Add process health checks before blocking operations
  • Implement graceful cleanup with escalating termination

Fixes #7741

Description

The DecoupledCheckpointEngine had several critical issues that could cause training jobs to hang indefinitely:

  1. No timeout on save_event.wait(): If the checkpoint process died, training would hang forever waiting for an event that would never fire.

  2. No timeout on ckpt_process.join(): If the process crashed or hung, cleanup would block indefinitely.

  3. Assertion used for runtime validation: assert info == self.commit_info is disabled with python -O, allowing silent data corruption in production.

  4. __del__ could hang: The destructor called cleanup() which could block, causing program hang on exit.

Changes

  • Add _wait_for_event_with_timeout() that checks process health every 10 seconds while waiting
  • Add _check_process_alive() helper to validate process state before blocking operations
  • Replace assert with proper if/raise ValueError for commit info validation
  • Add timeout to join() with escalating termination: join()terminate()kill()
  • Wrap __del__ in try/except and add _cleanup_called flag to prevent multiple cleanup calls
  • Add health checks in save() and commit() before queue operations

Test plan

  • Verify normal checkpoint save/load still works
  • Test behavior when checkpoint process is killed mid-save
  • Verify timeout triggers after 5 minutes of no response
  • Confirm graceful shutdown with Ctrl+C during checkpoint
  • Test with python -O to verify validation still works

Add timeout and process health checks to prevent deadlocks in decoupled
checkpoint engine. This includes:
- Timeout for checkpoint commit operations (default 5 minutes)
- Process health checks while waiting for checkpoint completion
- Proper error handling in cleanup to prevent crashes
- Forceful termination if process doesn't exit gracefully

Signed-off-by: Rakshit-gen <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@Rakshit-gen thanks for improving this feature. Are you planning to use DecoupledCheckpointEngine?

@sfc-gh-truwase
Copy link
Collaborator

Can you also clarify if this is ready for review? I ask because of the test plan checklist.

@Rakshit-gen
Copy link
Contributor Author

@sfc-gh-truwase Yes, ready for review. The checklist is for reviewer validation - I've tested basic functionality locally but would appreciate someone with a multi-GPU decoupled checkpointing setup to verify the edge cases.

@Rakshit-gen
Copy link
Contributor Author

No, I'm not currently using DecoupledCheckpointEngine. I found these issues while reading through the codebase to understand how DeepSpeed handles async checkpointing. The missing timeouts and use of assert for validation stood out as potential production risks, so I figured I'd submit a fix while I was in there.

@sfc-gh-truwase
Copy link
Collaborator

understand how DeepSpeed handles async checkpointing. The missing timeouts and use of assert for validation stood out as potential production risks, so I figured I'd submit a fix while I was in there.

Awesome! Thanks for being so generous.

@Rakshit-gen
Copy link
Contributor Author

Awesome! Thanks for being so generous.

Thanks! Happy to help improve the codebase.

@Rakshit-gen
Copy link
Contributor Author

is there any blocker? @sfc-gh-truwase

Add timeout and process health checks to prevent deadlocks in decoupled
checkpoint engine. This includes:
- Timeout for checkpoint commit operations (default 5 minutes)
- Process health checks while waiting for checkpoint completion
- Proper error handling in cleanup to prevent crashes
- Forceful termination if process doesn't exit gracefully

Signed-off-by: Rakshit-gen <[email protected]>
…live() and is only called after guarding with if self.ckpt_process is not None. The None case is handled separately as a valid state for ranks that don't have a checkpoint process.

Signed-off-by: Rakshit-gen <[email protected]>
…just if self.global_rank == 0:

Signed-off-by: Rakshit-gen <[email protected]>
@sfc-gh-truwase sfc-gh-truwase enabled auto-merge (squash) December 22, 2025 15:05
@sfc-gh-truwase sfc-gh-truwase merged commit ba29fdc into deepspeedai:master Dec 22, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] DecoupledCheckpointEngine hangs indefinitely due to missing timeouts and process health checks

2 participants