Skip to content

Conversation

@DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Nov 10, 2025

Description

Various fixes and cleanups related to IdleMonitor. Particularly, if an agent worker stops or crashes, it should be treated as dead for the purposes of the disconnect-after-idle-timeout.

Context

https://linear.app/buildkite/issue/PS-1390/3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor

If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.

Changes

  • Add a new agent state in IdleMonitor, "dead", which is when an agent has exited. "Dead" is like "idle" but is terminal (an agent can't exit the "dead" state).
  • Move idleness timeout logic from the ping loop to the IdleMonitor. idleMonitor now decides when the agent should exit due to idleness.
  • Un-export idleMonitor, newIdleMonitor, and all idleMonitor methods.
  • Move int -> time.Duration conversion earlier (when producing AgentConfiguration).

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go fmt ./...)
  • Some manual testing

Disclosures / Credits

I did not use AI tools at all

@DrJosh9000 DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch from a5330f7 to 58bc19e Compare November 10, 2025 02:25
@DrJosh9000 DrJosh9000 requested a review from a team November 10, 2025 03:01
@DrJosh9000 DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch 2 times, most recently from 34cbc10 to c719cce Compare November 10, 2025 03:44
Copy link
Contributor

@zhming0 zhming0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I have two non-blocking questions.

In addition, for the sake of transparency, should we describe the bug in PR description?

If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.

for _, worker := range r.workers {
go func() {
errCh <- r.runWorker(ctx, worker)
defer idleMon.markDead(worker)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use idleMon.markIdle here, it will work as well? do we need a concept of "dead" in idle monitor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a distinction:

Suppose we have two workers and an idle timeout of 600s. Suppose worker 1 finished a job 300 seconds ago and is idle, and worker 2 has just died. No more jobs come in.

If we mark worker 2 as idle, then the agent will start the timeout from now and exit 600s from now (an extra 300s after all alive agents became idle). If we mark worker 2 as dead, then the agent will exit in 300s, which I think is a bit more sensible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

@DrJosh9000 DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch from c719cce to a84fd08 Compare November 10, 2025 04:11
@DrJosh9000 DrJosh9000 merged commit 1db2de1 into main Nov 10, 2025
1 check passed
@DrJosh9000 DrJosh9000 deleted the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch November 10, 2025 04:51
scadu added a commit that referenced this pull request Nov 13, 2025
In #3570 idle logic has changed and `idleMon.markIdle` would get only
triggered via `AcceptAndRunJob()`.
Agents that never received a job wouldn't be marked as idle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants