-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Fix DecoupledCheckpointEngine deadlock and improve reliability #7742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DecoupledCheckpointEngine deadlock and improve reliability #7742
Conversation
Add timeout and process health checks to prevent deadlocks in decoupled checkpoint engine. This includes: - Timeout for checkpoint commit operations (default 5 minutes) - Process health checks while waiting for checkpoint completion - Proper error handling in cleanup to prevent crashes - Forceful termination if process doesn't exit gracefully Signed-off-by: Rakshit-gen <[email protected]>
|
@Rakshit-gen thanks for improving this feature. Are you planning to use DecoupledCheckpointEngine? |
|
Can you also clarify if this is ready for review? I ask because of the test plan checklist. |
|
@sfc-gh-truwase Yes, ready for review. The checklist is for reviewer validation - I've tested basic functionality locally but would appreciate someone with a multi-GPU decoupled checkpointing setup to verify the edge cases. |
|
No, I'm not currently using DecoupledCheckpointEngine. I found these issues while reading through the codebase to understand how DeepSpeed handles async checkpointing. The missing timeouts and use of assert for validation stood out as potential production risks, so I figured I'd submit a fix while I was in there. |
Awesome! Thanks for being so generous. |
Thanks! Happy to help improve the codebase. |
|
is there any blocker? @sfc-gh-truwase |
Add timeout and process health checks to prevent deadlocks in decoupled checkpoint engine. This includes: - Timeout for checkpoint commit operations (default 5 minutes) - Process health checks while waiting for checkpoint completion - Proper error handling in cleanup to prevent crashes - Forceful termination if process doesn't exit gracefully Signed-off-by: Rakshit-gen <[email protected]>
deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py
Outdated
Show resolved
Hide resolved
deepspeed/runtime/checkpoint_engine/decoupled_checkpoint_engine.py
Outdated
Show resolved
Hide resolved
…live() and is only called after guarding with if self.ckpt_process is not None. The None case is handled separately as a valid state for ranks that don't have a checkpoint process. Signed-off-by: Rakshit-gen <[email protected]>
…just if self.global_rank == 0: Signed-off-by: Rakshit-gen <[email protected]>
Summary
Fixes #7741
Description
The
DecoupledCheckpointEnginehad several critical issues that could cause training jobs to hang indefinitely:No timeout on
save_event.wait(): If the checkpoint process died, training would hang forever waiting for an event that would never fire.No timeout on
ckpt_process.join(): If the process crashed or hung, cleanup would block indefinitely.Assertion used for runtime validation:
assert info == self.commit_infois disabled withpython -O, allowing silent data corruption in production.__del__could hang: The destructor calledcleanup()which could block, causing program hang on exit.Changes
_wait_for_event_with_timeout()that checks process health every 10 seconds while waiting_check_process_alive()helper to validate process state before blocking operationsassertwith properif/raise ValueErrorfor commit info validationjoin()with escalating termination:join()→terminate()→kill()__del__in try/except and add_cleanup_calledflag to prevent multiple cleanup callssave()andcommit()before queue operationsTest plan
python -Oto verify validation still works