Commit c272526
[SJD] [RFC] force setting last progress time (#138615)
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization
The workaround is to update the healthcheck time when calling `get_last_progress_time`
Test Plan:
Logs show that the progress time value is being changed despite client not being set
Behavior when watchdog is enabled with SJD config is left unchanged
Differential Revision: D64733766
Pull Request resolved: #138615
Approved by: https://github.com/gag1jain1 parent cdfe1bf commit c272526
1 file changed
+4
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
179 | 179 | | |
180 | 180 | | |
181 | 181 | | |
| 182 | + | |
| 183 | + | |
182 | 184 | | |
183 | 185 | | |
184 | 186 | | |
| |||
249 | 251 | | |
250 | 252 | | |
251 | 253 | | |
| 254 | + | |
252 | 255 | | |
253 | 256 | | |
254 | 257 | | |
| |||
390 | 393 | | |
391 | 394 | | |
392 | 395 | | |
393 | | - | |
| 396 | + | |
0 commit comments