Skip to content

Always send shutdown_worker RPC, fix WorkerStatus state when shutting down worker#1082

Merged
yuandrew merged 6 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-shutdown-status
Jan 9, 2026
Merged

Always send shutdown_worker RPC, fix WorkerStatus state when shutting down worker#1082
yuandrew merged 6 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-shutdown-status

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Dec 16, 2025

What was changed

Always send shutdown_worker RPC, decouple disabling eager workflow start and worker heartbeat unregistration for worker shutdown

Why?

shutdown_worker RPC doesn't indicate that the worker has fully shutdown, only that it has started. Server and others can tell that a worker has fully shutdown by checking if there has been a heartbeat within the "heartbeat interval" after receiving the ShuttingDown status.

  1. WorkerStatus with shutdown today is not accurate. The ShutdownWorker RPC does not indicate that the worker has shutdown, but only that it has begun shutting down. So before this PR, we are today marking a worker as shutdown even though it is still in the process of shutting down.
  2. With this accurately setting status to shutdown, there is no mechanism for a worker to communicate to server that it is fully shutdown. It is up to the server to mark it as fully shutdown, using its own TTL definition. Today this is defaulted to 5 minutes, https://github.com/temporalio/temporal/blob/main/common/dynamicconfig/constants.go#L1401. This will be improved in the future to be a shorter interval, likely relating to the heartbeat interval that the worker needs to start sending. This requires an API update to add this field to the heartbeat.

Checklist

  1. Closes

  2. How was this tested:

  1. Any docs updates needed?

Note

Implements two-phase worker unregistration and ensures shutdown RPC is always sent.

  • Introduces unregister_slot_provider and finalize_unregister in ClientWorkerSet to decouple disabling eager workflow start from heartbeat unregistration; enforces order and updates all call sites (shutdown, replace_client, tests)
  • Worker shutdown now always invokes shutdown_worker with final heartbeat and sets status to ShuttingDown; tests updated to expect ShuttingDown and mocks always allow shutdown_worker
  • Removes client-side mutation of heartbeat status during shutdown RPC; heartbeat cleanup moved to end via finalize_unregister
  • Minor test/log tweaks (string formatting)

Written by Cursor Bugbot for commit 5221608. This will update automatically on new commits. Configure here.

@yuandrew yuandrew marked this pull request as ready for review December 17, 2025 19:46
@yuandrew yuandrew requested a review from a team as a code owner December 17, 2025 19:46
Comment thread crates/client/src/worker/mod.rs Outdated
.workers()
.unregister_worker(self.worker_instance_key);
.unregister_slot_provider(self.worker_instance_key);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Shutdown status not set on initiation

initiate_shutdown no longer updates self.status to WorkerStatus::ShuttingDown. Callers that use initiate_shutdown to begin shutdown (before awaiting shutdown/finalize_shutdown) will keep sending heartbeats with Running, delaying/obscuring shutdown signaling and breaking server-side “seen ShuttingDown then no heartbeat” detection.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want ShuttingDown state to be set when we send the worker_shutdown RPC call

// This is a best effort call and we can still shutdown the worker if it fails
match self.client.shutdown_worker(sticky_name, heartbeat).await {
Err(err)
if !matches!(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Empty sticky queue sent on shutdown

shutdown() now always calls shutdown_worker and uses unwrap_or_default() for sticky_name, which becomes an empty string when no sticky queue is used (e.g., max_cached_workflows == 0 or workflow polling disabled). If the server treats an empty sticky_task_queue as invalid when implemented, this can cause noisy warnings and failed shutdown signaling.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, we want to start always sending shutdown_worker, not just on sticky queue

Copy link
Copy Markdown
Member

@Sushisource Sushisource left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me (aside from it looks like a few tests need updating), just one minor thing

Comment thread crates/client/src/worker/mod.rs Outdated
@yuandrew
Copy link
Copy Markdown
Contributor Author

yuandrew commented Jan 9, 2026

Will update the PR description with some new information

Some things to note:

  1. WorkerStatus with shutdown today is not accurate. The ShutdownWorker RPC does not indicate that the worker has shutdown, but only that it has begun shutting down. So before this PR, we are today marking a worker as shutdown even though it is still in the process of shutting down.
  2. With this accurately setting status to shutdown, there is no mechanism for a worker to communicate to server that it is fully shutdown. It is up to the server to mark it as fully shutdown, using its own TTL definition. Today this is defaulted to 5 minutes, https://github.com/temporalio/temporal/blob/main/common/dynamicconfig/constants.go#L1401. This will be improved in the future to be a shorter interval, likely relating to the heartbeat interval that the worker needs to start sending. This requires an API update to add this field to the heartbeat.

@yuandrew yuandrew merged commit d8a2bf1 into temporalio:master Jan 9, 2026
20 checks passed
@yuandrew yuandrew deleted the worker-heartbeat-shutdown-status branch January 9, 2026 21:44
yuandrew added a commit to temporalio/api that referenced this pull request Jan 23, 2026
…#686)

**What changed?**
- Added a new `worker_instance_key` field to `ShutdownWorkerRequest`, as
well as all other poll calls.
- Added comment to `ShutdownWorkerRequest.sticky_task_queue` saying it
may be blank, now that we've expanded the scope of when
`ShutdownWorkerRequest` is called.
- Added `task_queue` and `task_queue_kind` to `ShutdownWorkerRequest`

**Why?**
ShutdownWorker was changed to always be sent by SDK
(temporalio/sdk-core#1082), so sticky queue name
is now optional. This plus the new heartbeat info we send on shutdown
means Server will now have a more accurate map of which workers are
shutting down.

Adding task queue and task_queue_kind should also allow us to fix a lost
task issue, where there is a race when the SDK cancels an outstanding
poll rpc and the server decides to send a task to that poller.

Technically some of this info exists in the worker heartbeat part of the
message, but it needs to be lifted to its own field due to the scenario
where worker heartbeating is disabled.


**Breaking changes**
N/A I think, just adding new fields

**Server PR**
temporal-cicd bot pushed a commit to temporalio/api-go that referenced this pull request Jan 23, 2026
… (#686)

**What changed?**
- Added a new `worker_instance_key` field to `ShutdownWorkerRequest`, as
well as all other poll calls.
- Added comment to `ShutdownWorkerRequest.sticky_task_queue` saying it
may be blank, now that we've expanded the scope of when
`ShutdownWorkerRequest` is called.
- Added `task_queue` and `task_queue_kind` to `ShutdownWorkerRequest`

**Why?**
ShutdownWorker was changed to always be sent by SDK
(temporalio/sdk-core#1082), so sticky queue name
is now optional. This plus the new heartbeat info we send on shutdown
means Server will now have a more accurate map of which workers are
shutting down.

Adding task queue and task_queue_kind should also allow us to fix a lost
task issue, where there is a race when the SDK cancels an outstanding
poll rpc and the server decides to send a task to that poller.

Technically some of this info exists in the worker heartbeat part of the
message, but it needs to be lifted to its own field due to the scenario
where worker heartbeating is disabled.

**Breaking changes**
N/A I think, just adding new fields

**Server PR**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants