Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Rakshit-gen · 2025-12-20T13:37:18Z

Summary

Add timeouts to prevent indefinite hangs when checkpoint process crashes
Replace assertion with proper runtime validation
Add process health checks before blocking operations
Implement graceful cleanup with escalating termination

Fixes #7741

Description

The DecoupledCheckpointEngine had several critical issues that could cause training jobs to hang indefinitely:

No timeout on save_event.wait(): If the checkpoint process died, training would hang forever waiting for an event that would never fire.
No timeout on ckpt_process.join(): If the process crashed or hung, cleanup would block indefinitely.
Assertion used for runtime validation: assert info == self.commit_info is disabled with python -O, allowing silent data corruption in production.
__del__ could hang: The destructor called cleanup() which could block, causing program hang on exit.

Changes

Add _wait_for_event_with_timeout() that checks process health every 10 seconds while waiting
Add _check_process_alive() helper to validate process state before blocking operations
Replace assert with proper if/raise ValueError for commit info validation
Add timeout to join() with escalating termination: join() → terminate() → kill()
Wrap __del__ in try/except and add _cleanup_called flag to prevent multiple cleanup calls
Add health checks in save() and commit() before queue operations

Test plan

Verify normal checkpoint save/load still works
Test behavior when checkpoint process is killed mid-save
Verify timeout triggers after 5 minutes of no response
Confirm graceful shutdown with Ctrl+C during checkpoint
Test with python -O to verify validation still works

Add timeout and process health checks to prevent deadlocks in decoupled checkpoint engine. This includes: - Timeout for checkpoint commit operations (default 5 minutes) - Process health checks while waiting for checkpoint completion - Proper error handling in cleanup to prevent crashes - Forceful termination if process doesn't exit gracefully Signed-off-by: Rakshit-gen <[email protected]>

sfc-gh-truwase · 2025-12-22T12:54:27Z

@Rakshit-gen thanks for improving this feature. Are you planning to use DecoupledCheckpointEngine?

sfc-gh-truwase · 2025-12-22T12:55:32Z

Can you also clarify if this is ready for review? I ask because of the test plan checklist.

Rakshit-gen · 2025-12-22T12:57:56Z

@sfc-gh-truwase Yes, ready for review. The checklist is for reviewer validation - I've tested basic functionality locally but would appreciate someone with a multi-GPU decoupled checkpointing setup to verify the edge cases.

Rakshit-gen · 2025-12-22T13:00:16Z

No, I'm not currently using DecoupledCheckpointEngine. I found these issues while reading through the codebase to understand how DeepSpeed handles async checkpointing. The missing timeouts and use of assert for validation stood out as potential production risks, so I figured I'd submit a fix while I was in there.

sfc-gh-truwase · 2025-12-22T13:01:30Z

understand how DeepSpeed handles async checkpointing. The missing timeouts and use of assert for validation stood out as potential production risks, so I figured I'd submit a fix while I was in there.

Awesome! Thanks for being so generous.

Rakshit-gen · 2025-12-22T13:03:17Z

Awesome! Thanks for being so generous.

Thanks! Happy to help improve the codebase.

Rakshit-gen · 2025-12-22T14:37:56Z

is there any blocker? @sfc-gh-truwase

deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py

Add timeout and process health checks to prevent deadlocks in decoupled checkpoint engine. This includes: - Timeout for checkpoint commit operations (default 5 minutes) - Process health checks while waiting for checkpoint completion - Proper error handling in cleanup to prevent crashes - Forceful termination if process doesn't exit gracefully Signed-off-by: Rakshit-gen <[email protected]>

deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py

…live() and is only called after guarding with if self.ckpt_process is not None. The None case is handled separately as a valid state for ranks that don't have a checkpoint process. Signed-off-by: Rakshit-gen <[email protected]>

…just if self.global_rank == 0: Signed-off-by: Rakshit-gen <[email protected]>

Rakshit-gen requested a review from tjruwase as a code owner December 20, 2025 13:37

Merge branch 'master' into fix/decoupled-checkpoint-deadlock

966df3d

Merge branch 'master' into fix/decoupled-checkpoint-deadlock

3ed642f

sfc-gh-truwase reviewed Dec 22, 2025

View reviewed changes

deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py Show resolved Hide resolved

sfc-gh-truwase reviewed Dec 22, 2025

View reviewed changes

deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py Outdated Show resolved Hide resolved

sfc-gh-truwase reviewed Dec 22, 2025

View reviewed changes

deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py Outdated Show resolved Hide resolved

Rakshit-gen added 2 commits December 22, 2025 20:26

Added assert after line 189 and reverted the print condition back to …

af310d1

…just if self.global_rank == 0: Signed-off-by: Rakshit-gen <[email protected]>

sfc-gh-truwase approved these changes Dec 22, 2025

View reviewed changes

sfc-gh-truwase enabled auto-merge (squash) December 22, 2025 15:05

sfc-gh-truwase merged commit ba29fdc into deepspeedai:master Dec 22, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Uh oh!

Rakshit-gen commented Dec 20, 2025 •

edited

Loading

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Fix DecoupledCheckpointEngine deadlock and improve reliability #7742

Uh oh!

Conversation

Rakshit-gen commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Changes

Test plan

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

sfc-gh-truwase commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Rakshit-gen commented Dec 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rakshit-gen commented Dec 20, 2025 •

edited

Loading