Review of signal handling and grace periods #3534
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Various changes related to signals, signal handling, and grace periods.
Context
https://linear.app/buildkite/issue/PS-1220
https://linear.app/buildkite/issue/PS-1069
Changes
Signal vs cancel grace period
When a job is cancelled (either through the UI, or when the agent is ctrl-C-ed twice), the timeline is meant to go like this (I think):
t0: Cancellation begins. The job runner is told toCancel, which sends anInterruptto the job "process". Which is either aprocess(and the bootstrap is in a proper subprocess of the agent) or the Kubernetes runner (the bootstrap is in another container in the pod).SIGTERMto the process group of the subprocess.RunStateInterruptwhich the client side should notice in the next poll.Either way, the bootstrap should handle a
SIGTERMand wrap itself up withinsignalGracePeriod.t0 + x < signalGracePeriod: Happy path. The subprocess/other container has wrapped up and we can mark the job finished (and disconnect if the agent is stopping).t0 + signalGracePeriod: Not happy path. Now weTerminate(send aSIGKILL) to the subprocess. There's no equivalent for Kubernetes, since we can'tSIGKILLsomething in a different PID namespace, but we can pretend it has died and move on.t0 + (signalGracePeriod + x) < cancelGracePeriod: Bittersweet path - the subprocess was killed, but we uploaded the logs and marked it finished.t0 + cancelGracePeriod: Bad end - we need to wrap it up now. There's no more time to finish gracefully. Go to the next job if not stopping, or exit the agent if stopping.The first problem is that the job runner's
Cancelimplementation diverged fromprocess, and sentSIGKILLaftercancelGracePeriod, not aftersignalGracePeriod. So I changed it to wait forsignalGracePeriodinstead.Kubernetes
Kubernetes has the concept of
terminationGracePeriodSeconds, which implements the same thing assignalGracePeriodabove, but it happens when the pod is deleted. It also sendsSIGTERMto every container in the pod.Currently, on
SIGTERMand others, thekubernetes-bootstrapcancels the subprocess context, which also runs through theInterrupt-signalGracePeriod-Terminatedance. Does this mean the bootstrap receivesSIGTERMtwice? If so, thebootstrapsignal handler removes itself after the first signal, causing the second signal to be handled according to the normal Go signal handling, which kills it before the grace period expires.We need to listen to one or the other, and I'm choosing to favour the signal passed over the agent socket. Thus
SIGTERMand friends will be swallowed bykubernetes-bootstrap.Ctrl-C
The double-Ctrl-C behaviour is useful when running the agent in a local terminal. But the code path in the
startsignal handler is shared betweenSIGINT(default Ctrl-C signal) andSIGTERM.This means that the Kubernetes-delivered
SIGTERMcauses the agent to enter graceful exit mode, waiting for the job to complete. So theagentcontainer never starts ungracefully cancelling the job. We're back to 1SIGTERMfor thebootstrapcontainer!Thus on Kubernetes mode, I'm choosing to go straight to ungraceful cancellation on
SIGINT/SIGTERM.Bonus 1: I've implemented a third Ctrl-C to really exit the agent immediately.
Bonus 2: I set up a struct to pass the configuration for the signal handler rather than add another arg.
Graceful contexts
But back to the
startsignal handler current implementation. On the secondSIGINT/SIGTERMit does this:ctxand is cancelled 1 cancellation grace period afterctxis done.Stop(false)ctxI recall that it seemed reasonable at the time, but the effect is that halfway through the cancel grace period, which is usually only 1 second more than the signal grace period, the agent loses the ability to upload logs or finish the job.
It's a complicated but effective footgun. I took it out and replaced it with a much simpler dead-mans-switch that waits the full cancel grace period before exiting.
What about
KubernetesLogCollectionGracePeriod?Good question. I think #3500 didn't actually change what it intended to change.
KubernetesLogCollectionGracePeriodchanges the time waited in the functionwaitForKubernetesProcessesToComplete, which is called in the job runner'scleanup.cleanupis deferred in the job runner'sRun, which to cut a long story short, only starts returning (and running defers) after the "process" has already exited. So it should always see the process has finished, and never wait.What we probably need to fix in the k8s case is the default
signalGracePeriod(probably via the 1-second-before-cancelGracePeriodcomputation). If the defaultterminationGracePeriodSecondsis 60, then we should consider settingBUILDKITE_CANCEL_GRACE_PERIODto around 60 too. This will mean changing agent-stack-k8s. But here, I've just ripped out the bulk of #3500, leaving the flag for backwards compatibility.Bootstrap self-signaling
When the bootstrap receives one of the signals it handles in a special way, it registers a cancellation and removes the custom signal handler, then once the cancellation is complete, it re-signals itself with the signal so that the default signal handler handles the signal. The expected behaviour is that it exits with status 128+signal. Go's default behaviour on SIGQUIT is to panic and exit with status 2. Since there seems to be no way to "exit due to SIGQUIT but without a panic" in Go, I've turned it into a manual "exit 131".
Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go fmt ./...)Disclosures / Credits
I did not use AI tools at all