Review of signal handling and grace periods #3534

DrJosh9000 · 2025-10-16T06:58:59Z

Description

Various changes related to signals, signal handling, and grace periods.

Context

https://linear.app/buildkite/issue/PS-1220

https://linear.app/buildkite/issue/PS-1069

Changes

Signal vs cancel grace period

When a job is cancelled (either through the UI, or when the agent is ctrl-C-ed twice), the timeline is meant to go like this (I think):

t0: Cancellation begins. The job runner is told to Cancel, which sends an Interrupt to the job "process". Which is either a process (and the bootstrap is in a proper subprocess of the agent) or the Kubernetes runner (the bootstrap is in another container in the pod).
- When it's a subprocess, this is a SIGTERM to the process group of the subprocess.
- When it's Kubernetes, it changes the state to RunStateInterrupt which the client side should notice in the next poll.
  Either way, the bootstrap should handle a SIGTERM and wrap itself up within signalGracePeriod.
t0 + x < signalGracePeriod: Happy path. The subprocess/other container has wrapped up and we can mark the job finished (and disconnect if the agent is stopping).
t0 + signalGracePeriod: Not happy path. Now we Terminate (send a SIGKILL) to the subprocess. There's no equivalent for Kubernetes, since we can't SIGKILL something in a different PID namespace, but we can pretend it has died and move on.
t0 + (signalGracePeriod + x) < cancelGracePeriod: Bittersweet path - the subprocess was killed, but we uploaded the logs and marked it finished.
t0 + cancelGracePeriod: Bad end - we need to wrap it up now. There's no more time to finish gracefully. Go to the next job if not stopping, or exit the agent if stopping.

The first problem is that the job runner's Cancel implementation diverged from process, and sent SIGKILL after cancelGracePeriod, not after signalGracePeriod. So I changed it to wait for signalGracePeriod instead.

Kubernetes

Kubernetes has the concept of terminationGracePeriodSeconds, which implements the same thing as signalGracePeriod above, but it happens when the pod is deleted. It also sends SIGTERM to every container in the pod.

Currently, on SIGTERM and others, the kubernetes-bootstrap cancels the subprocess context, which also runs through the Interrupt-signalGracePeriod-Terminate dance. Does this mean the bootstrap receives SIGTERM twice? If so, the bootstrap signal handler removes itself after the first signal, causing the second signal to be handled according to the normal Go signal handling, which kills it before the grace period expires.

We need to listen to one or the other, and I'm choosing to favour the signal passed over the agent socket. Thus SIGTERM and friends will be swallowed by kubernetes-bootstrap.

Ctrl-C

The double-Ctrl-C behaviour is useful when running the agent in a local terminal. But the code path in the start signal handler is shared between SIGINT (default Ctrl-C signal) and SIGTERM.

This means that the Kubernetes-delivered SIGTERM causes the agent to enter graceful exit mode, waiting for the job to complete. So the agent container never starts ungracefully cancelling the job. We're back to 1 SIGTERM for the bootstrap container!

Thus on Kubernetes mode, I'm choosing to go straight to ungraceful cancellation on SIGINT/SIGTERM.

Bonus 1: I've implemented a third Ctrl-C to really exit the agent immediately.
Bonus 2: I set up a struct to pass the configuration for the signal handler rather than add another arg.

Graceful contexts

But back to the start signal handler current implementation. On the second SIGINT/SIGTERM it does this:

Create a new graceful context whose parent is ctx and is cancelled 1 cancellation grace period after ctx is done.
Asynchronously:
1. send the worker pool a Stop(false)
2. Sleep half the cancel grace period
3. and then call the cancel function for ctx
Wait for the graceful context and then exit 1.

I recall that it seemed reasonable at the time, but the effect is that halfway through the cancel grace period, which is usually only 1 second more than the signal grace period, the agent loses the ability to upload logs or finish the job.

It's a complicated but effective footgun. I took it out and replaced it with a much simpler dead-mans-switch that waits the full cancel grace period before exiting.

What about `KubernetesLogCollectionGracePeriod`?

Good question. I think #3500 didn't actually change what it intended to change.

KubernetesLogCollectionGracePeriod changes the time waited in the function waitForKubernetesProcessesToComplete, which is called in the job runner's cleanup. cleanup is deferred in the job runner's Run, which to cut a long story short, only starts returning (and running defers) after the "process" has already exited. So it should always see the process has finished, and never wait.

What we probably need to fix in the k8s case is the default signalGracePeriod (probably via the 1-second-before-cancelGracePeriod computation). If the default terminationGracePeriodSeconds is 60, then we should consider setting BUILDKITE_CANCEL_GRACE_PERIOD to around 60 too. This will mean changing agent-stack-k8s. But here, I've just ripped out the bulk of #3500, leaving the flag for backwards compatibility.

Bootstrap self-signaling

When the bootstrap receives one of the signals it handles in a special way, it registers a cancellation and removes the custom signal handler, then once the cancellation is complete, it re-signals itself with the signal so that the default signal handler handles the signal. The expected behaviour is that it exits with status 128+signal. Go's default behaviour on SIGQUIT is to panic and exit with status 2. Since there seems to be no way to "exit due to SIGQUIT but without a panic" in Go, I've turned it into a manual "exit 131".

Testing

Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
Code is formatted (with go fmt ./...)

Disclosures / Credits

I did not use AI tools at all

wolfeidau

Some observations, not sure how helpful they are.

agent/run_job.go

clicommand/kubernetes_bootstrap.go

wolfeidau

👍🏻 🚀

DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 2 times, most recently from b9c3bbb to a301f55 Compare October 16, 2025 07:09

DrJosh9000 marked this pull request as ready for review October 16, 2025 07:10

DrJosh9000 requested a review from a team October 16, 2025 07:10

DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 2 times, most recently from 1f3254f to 40e8ae3 Compare October 20, 2025 04:03

wolfeidau requested changes Oct 20, 2025

View reviewed changes

agent/run_job.go Outdated Show resolved Hide resolved

agent/run_job.go Outdated Show resolved Hide resolved

clicommand/kubernetes_bootstrap.go Show resolved Hide resolved

clicommand/kubernetes_bootstrap.go Outdated Show resolved Hide resolved

DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 3 times, most recently from 0e01d98 to cc361b4 Compare October 20, 2025 05:47

DrJosh9000 added 2 commits October 20, 2025 16:58

Review of signal handling and grace periods

bcec648

Exit directly on panicking signals

d0b0e8c

DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch from cc361b4 to d0b0e8c Compare October 20, 2025 05:58

DrJosh9000 requested a review from wolfeidau October 20, 2025 06:40

wolfeidau approved these changes Oct 20, 2025

View reviewed changes

DrJosh9000 merged commit 131f396 into main Oct 20, 2025
1 check passed

DrJosh9000 deleted the ps-1220-fix-signal-handling branch October 20, 2025 23:07

This was referenced Oct 21, 2025

More agent stopping / job cancellation review #3547

Merged

Set {signal,cancel} grace period agent config buildkite/agent-stack-k8s#748

Merged

Fix client lost due to sleep #3549

Merged

Bump version and changelog for v3.110.0 #3550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Review of signal handling and grace periods #3534

Review of signal handling and grace periods #3534

Uh oh!

DrJosh9000 commented Oct 16, 2025 •

edited

Loading

Uh oh!

wolfeidau left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wolfeidau left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Review of signal handling and grace periods #3534

Review of signal handling and grace periods #3534

Uh oh!

Conversation

DrJosh9000 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Changes

Signal vs cancel grace period

Kubernetes

Ctrl-C

Graceful contexts

What about KubernetesLogCollectionGracePeriod?

Bootstrap self-signaling

Testing

Disclosures / Credits

Uh oh!

wolfeidau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wolfeidau left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrJosh9000 commented Oct 16, 2025 •

edited

Loading

What about `KubernetesLogCollectionGracePeriod`?