Always send shutdown_worker RPC, fix WorkerStatus state when shutting down worker#1082
Conversation
| .workers() | ||
| .unregister_worker(self.worker_instance_key); | ||
| .unregister_slot_provider(self.worker_instance_key); | ||
| } |
There was a problem hiding this comment.
Bug: Shutdown status not set on initiation
initiate_shutdown no longer updates self.status to WorkerStatus::ShuttingDown. Callers that use initiate_shutdown to begin shutdown (before awaiting shutdown/finalize_shutdown) will keep sending heartbeats with Running, delaying/obscuring shutdown signaling and breaking server-side “seen ShuttingDown then no heartbeat” detection.
There was a problem hiding this comment.
We want ShuttingDown state to be set when we send the worker_shutdown RPC call
| // This is a best effort call and we can still shutdown the worker if it fails | ||
| match self.client.shutdown_worker(sticky_name, heartbeat).await { | ||
| Err(err) | ||
| if !matches!( |
There was a problem hiding this comment.
Bug: Empty sticky queue sent on shutdown
shutdown() now always calls shutdown_worker and uses unwrap_or_default() for sticky_name, which becomes an empty string when no sticky queue is used (e.g., max_cached_workflows == 0 or workflow polling disabled). If the server treats an empty sticky_task_queue as invalid when implemented, this can cause noisy warnings and failed shutdown signaling.
There was a problem hiding this comment.
This is intentional, we want to start always sending shutdown_worker, not just on sticky queue
|
Will update the PR description with some new information Some things to note:
|
…#686) **What changed?** - Added a new `worker_instance_key` field to `ShutdownWorkerRequest`, as well as all other poll calls. - Added comment to `ShutdownWorkerRequest.sticky_task_queue` saying it may be blank, now that we've expanded the scope of when `ShutdownWorkerRequest` is called. - Added `task_queue` and `task_queue_kind` to `ShutdownWorkerRequest` **Why?** ShutdownWorker was changed to always be sent by SDK (temporalio/sdk-core#1082), so sticky queue name is now optional. This plus the new heartbeat info we send on shutdown means Server will now have a more accurate map of which workers are shutting down. Adding task queue and task_queue_kind should also allow us to fix a lost task issue, where there is a race when the SDK cancels an outstanding poll rpc and the server decides to send a task to that poller. Technically some of this info exists in the worker heartbeat part of the message, but it needs to be lifted to its own field due to the scenario where worker heartbeating is disabled. **Breaking changes** N/A I think, just adding new fields **Server PR**
… (#686) **What changed?** - Added a new `worker_instance_key` field to `ShutdownWorkerRequest`, as well as all other poll calls. - Added comment to `ShutdownWorkerRequest.sticky_task_queue` saying it may be blank, now that we've expanded the scope of when `ShutdownWorkerRequest` is called. - Added `task_queue` and `task_queue_kind` to `ShutdownWorkerRequest` **Why?** ShutdownWorker was changed to always be sent by SDK (temporalio/sdk-core#1082), so sticky queue name is now optional. This plus the new heartbeat info we send on shutdown means Server will now have a more accurate map of which workers are shutting down. Adding task queue and task_queue_kind should also allow us to fix a lost task issue, where there is a race when the SDK cancels an outstanding poll rpc and the server decides to send a task to that poller. Technically some of this info exists in the worker heartbeat part of the message, but it needs to be lifted to its own field due to the scenario where worker heartbeating is disabled. **Breaking changes** N/A I think, just adding new fields **Server PR**
What was changed
Always send shutdown_worker RPC, decouple disabling eager workflow start and worker heartbeat unregistration for worker shutdown
Why?
shutdown_workerRPC doesn't indicate that the worker has fully shutdown, only that it has started. Server and others can tell that a worker has fully shutdown by checking if there has been a heartbeat within the "heartbeat interval" after receiving theShuttingDownstatus.WorkerStatuswith shutdown today is not accurate. TheShutdownWorkerRPC does not indicate that the worker has shutdown, but only that it has begun shutting down. So before this PR, we are today marking a worker as shutdown even though it is still in the process of shutting down.Checklist
Closes
How was this tested:
Note
Implements two-phase worker unregistration and ensures shutdown RPC is always sent.
unregister_slot_providerandfinalize_unregisterinClientWorkerSetto decouple disabling eager workflow start from heartbeat unregistration; enforces order and updates all call sites (shutdown,replace_client, tests)shutdown_workerwith final heartbeat and sets status toShuttingDown; tests updated to expectShuttingDownand mocks always allowshutdown_workerstatusduring shutdown RPC; heartbeat cleanup moved to end viafinalize_unregisterWritten by Cursor Bugbot for commit 5221608. This will update automatically on new commits. Configure here.