Skip to content

Commit c272526

Browse files
felixsu2006pytorchmergebot
authored andcommitted
[SJD] [RFC] force setting last progress time (#138615)
Summary: Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization The workaround is to update the healthcheck time when calling `get_last_progress_time` Test Plan: Logs show that the progress time value is being changed despite client not being set Behavior when watchdog is enabled with SJD config is left unchanged Differential Revision: D64733766 Pull Request resolved: #138615 Approved by: https://github.com/gag1jain
1 parent cdfe1bf commit c272526

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

torch/distributed/elastic/timer/file_based_local_timer.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,8 @@ def __init__(
179179
self._timers: Dict[Tuple[int, str], FileTimerRequest] = {}
180180
self._stop_signaled = False
181181
self._watchdog_thread: Optional[threading.Thread] = None
182+
183+
self._is_client_started = False
182184
if os.path.exists(self._file_path):
183185
os.remove(self._file_path)
184186
os.mkfifo(self._file_path)
@@ -249,6 +251,7 @@ def _watchdog_loop(self) -> None:
249251
# 2. We are running the watchdog loop in a separate daemon
250252
# thread, which will not block the process to stop.
251253
with open(self._file_path) as fd:
254+
self._is_client_started = True
252255
while not self._stop_signaled:
253256
try:
254257
run_once = self._run_once
@@ -390,4 +393,4 @@ def _reap_worker(self, worker_pid: int, signal: int) -> bool:
390393
return False
391394

392395
def get_last_progress_time(self) -> int:
393-
return self._last_progress_time
396+
return self._last_progress_time if self._is_client_started else int(time.time())

0 commit comments

Comments
 (0)