-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[SJD] [RFC] force setting last progress time #138615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @felixsu2006, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138615
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 9e3818d with merge base cc93c1e ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D64733766 |
Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: E2E test logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766
7372c2b to
9e3818d
Compare
|
This pull request was exported from Phabricator. Differential Revision: D64733766 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D64733766 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: Logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766 Pull Request resolved: #138615 Approved by: https://github.com/gag1jain
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling
get_last_progress_timeTest Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o