Skip to content

Conversation

@DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Oct 16, 2025

Description

Various changes related to signals, signal handling, and grace periods.

Context

https://linear.app/buildkite/issue/PS-1220

https://linear.app/buildkite/issue/PS-1069

Changes

Signal vs cancel grace period

When a job is cancelled (either through the UI, or when the agent is ctrl-C-ed twice), the timeline is meant to go like this (I think):

  • t0: Cancellation begins. The job runner is told to Cancel, which sends an Interrupt to the job "process". Which is either a process (and the bootstrap is in a proper subprocess of the agent) or the Kubernetes runner (the bootstrap is in another container in the pod).
    • When it's a subprocess, this is a SIGTERM to the process group of the subprocess.
    • When it's Kubernetes, it changes the state to RunStateInterrupt which the client side should notice in the next poll.
      Either way, the bootstrap should handle a SIGTERM and wrap itself up within signalGracePeriod.
  • t0 + x < signalGracePeriod: Happy path. The subprocess/other container has wrapped up and we can mark the job finished (and disconnect if the agent is stopping).
  • t0 + signalGracePeriod: Not happy path. Now we Terminate (send a SIGKILL) to the subprocess. There's no equivalent for Kubernetes, since we can't SIGKILL something in a different PID namespace, but we can pretend it has died and move on.
  • t0 + (signalGracePeriod + x) < cancelGracePeriod: Bittersweet path - the subprocess was killed, but we uploaded the logs and marked it finished.
  • t0 + cancelGracePeriod: Bad end - we need to wrap it up now. There's no more time to finish gracefully. Go to the next job if not stopping, or exit the agent if stopping.

The first problem is that the job runner's Cancel implementation diverged from process, and sent SIGKILL after cancelGracePeriod, not after signalGracePeriod. So I changed it to wait for signalGracePeriod instead.

Kubernetes

Kubernetes has the concept of terminationGracePeriodSeconds, which implements the same thing as signalGracePeriod above, but it happens when the pod is deleted. It also sends SIGTERM to every container in the pod.

Currently, on SIGTERM and others, the kubernetes-bootstrap cancels the subprocess context, which also runs through the Interrupt-signalGracePeriod-Terminate dance. Does this mean the bootstrap receives SIGTERM twice? If so, the bootstrap signal handler removes itself after the first signal, causing the second signal to be handled according to the normal Go signal handling, which kills it before the grace period expires.

We need to listen to one or the other, and I'm choosing to favour the signal passed over the agent socket. Thus SIGTERM and friends will be swallowed by kubernetes-bootstrap.

Ctrl-C

The double-Ctrl-C behaviour is useful when running the agent in a local terminal. But the code path in the start signal handler is shared between SIGINT (default Ctrl-C signal) and SIGTERM.

This means that the Kubernetes-delivered SIGTERM causes the agent to enter graceful exit mode, waiting for the job to complete. So the agent container never starts ungracefully cancelling the job. We're back to 1 SIGTERM for the bootstrap container!

Thus on Kubernetes mode, I'm choosing to go straight to ungraceful cancellation on SIGINT/SIGTERM.

Bonus 1: I've implemented a third Ctrl-C to really exit the agent immediately.
Bonus 2: I set up a struct to pass the configuration for the signal handler rather than add another arg.

Graceful contexts

But back to the start signal handler current implementation. On the second SIGINT/SIGTERM it does this:

  1. Create a new graceful context whose parent is ctx and is cancelled 1 cancellation grace period after ctx is done.
  2. Asynchronously:
    1. send the worker pool a Stop(false)
    2. Sleep half the cancel grace period
    3. and then call the cancel function for ctx
  3. Wait for the graceful context and then exit 1.

I recall that it seemed reasonable at the time, but the effect is that halfway through the cancel grace period, which is usually only 1 second more than the signal grace period, the agent loses the ability to upload logs or finish the job.

It's a complicated but effective footgun. I took it out and replaced it with a much simpler dead-mans-switch that waits the full cancel grace period before exiting.

What about KubernetesLogCollectionGracePeriod?

Good question. I think #3500 didn't actually change what it intended to change.

KubernetesLogCollectionGracePeriod changes the time waited in the function waitForKubernetesProcessesToComplete, which is called in the job runner's cleanup. cleanup is deferred in the job runner's Run, which to cut a long story short, only starts returning (and running defers) after the "process" has already exited. So it should always see the process has finished, and never wait.

What we probably need to fix in the k8s case is the default signalGracePeriod (probably via the 1-second-before-cancelGracePeriod computation). If the default terminationGracePeriodSeconds is 60, then we should consider setting BUILDKITE_CANCEL_GRACE_PERIOD to around 60 too. This will mean changing agent-stack-k8s. But here, I've just ripped out the bulk of #3500, leaving the flag for backwards compatibility.

Bootstrap self-signaling

When the bootstrap receives one of the signals it handles in a special way, it registers a cancellation and removes the custom signal handler, then once the cancellation is complete, it re-signals itself with the signal so that the default signal handler handles the signal. The expected behaviour is that it exits with status 128+signal. Go's default behaviour on SIGQUIT is to panic and exit with status 2. Since there seems to be no way to "exit due to SIGQUIT but without a panic" in Go, I've turned it into a manual "exit 131".

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)

Disclosures / Credits

I did not use AI tools at all

@DrJosh9000 DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 2 times, most recently from b9c3bbb to a301f55 Compare October 16, 2025 07:09
@DrJosh9000 DrJosh9000 marked this pull request as ready for review October 16, 2025 07:10
@DrJosh9000 DrJosh9000 requested a review from a team October 16, 2025 07:10
@DrJosh9000 DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 2 times, most recently from 1f3254f to 40e8ae3 Compare October 20, 2025 04:03
Copy link
Member

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations, not sure how helpful they are.

@DrJosh9000 DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch 3 times, most recently from 0e01d98 to cc361b4 Compare October 20, 2025 05:47
@DrJosh9000 DrJosh9000 force-pushed the ps-1220-fix-signal-handling branch from cc361b4 to d0b0e8c Compare October 20, 2025 05:58
@DrJosh9000 DrJosh9000 requested a review from wolfeidau October 20, 2025 06:40
Copy link
Member

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 🚀

@DrJosh9000 DrJosh9000 merged commit 131f396 into main Oct 20, 2025
1 check passed
@DrJosh9000 DrJosh9000 deleted the ps-1220-fix-signal-handling branch October 20, 2025 23:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants