IdleMonitor-related fixes #3570

DrJosh9000 · 2025-11-10T02:23:17Z

Description

Various fixes and cleanups related to IdleMonitor. Particularly, if an agent worker stops or crashes, it should be treated as dead for the purposes of the disconnect-after-idle-timeout.

Context

https://linear.app/buildkite/issue/PS-1390/3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor

If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.

Changes

Add a new agent state in IdleMonitor, "dead", which is when an agent has exited. "Dead" is like "idle" but is terminal (an agent can't exit the "dead" state).
Move idleness timeout logic from the ping loop to the IdleMonitor. idleMonitor now decides when the agent should exit due to idleness.
Un-export idleMonitor, newIdleMonitor, and all idleMonitor methods.
Move int -> time.Duration conversion earlier (when producing AgentConfiguration).

Testing

Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
Code is formatted (with go fmt ./...)
Some manual testing

Disclosures / Credits

I did not use AI tools at all

zhming0

Looks good to me. I have two non-blocking questions.

In addition, for the sake of transparency, should we describe the bug in PR description?

If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.

agent/idle_monitor.go

zhming0 · 2025-11-10T04:04:01Z

agent/agent_pool.go

 	for _, worker := range r.workers {
 		go func() {
-			errCh <- r.runWorker(ctx, worker)
+			defer idleMon.markDead(worker)


If we use idleMon.markIdle here, it will work as well? do we need a concept of "dead" in idle monitor?

There is a distinction:

Suppose we have two workers and an idle timeout of 600s. Suppose worker 1 finished a job 300 seconds ago and is idle, and worker 2 has just died. No more jobs come in.

If we mark worker 2 as idle, then the agent will start the timeout from now and exit 600s from now (an extra 300s after all alive agents became idle). If we mark worker 2 as dead, then the agent will exit in 300s, which I think is a bit more sensible.

That makes sense.

In #3570 idle logic has changed and `idleMon.markIdle` would get only triggered via `AcceptAndRunJob()`. Agents that never received a job wouldn't be marked as idle.

DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch from a5330f7 to 58bc19e Compare November 10, 2025 02:25

DrJosh9000 requested a review from a team November 10, 2025 03:01

DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch 2 times, most recently from 34cbc10 to c719cce Compare November 10, 2025 03:44

zhming0 approved these changes Nov 10, 2025

View reviewed changes

IdleMonitor-related fixes

a84fd08

DrJosh9000 force-pushed the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch from c719cce to a84fd08 Compare November 10, 2025 04:11

DrJosh9000 merged commit 1db2de1 into main Nov 10, 2025
1 check passed

DrJosh9000 deleted the ps-1390-3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor branch November 10, 2025 04:51

moskyb mentioned this pull request Nov 12, 2025

Bump version + changelog for v3.112.0 #3577

Merged

scadu mentioned this pull request Nov 13, 2025

Fix idle tracking for agents that never received jobs #3579

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IdleMonitor-related fixes #3570

IdleMonitor-related fixes #3570

Uh oh!

DrJosh9000 commented Nov 10, 2025 •

edited

Loading

Uh oh!

zhming0 left a comment

Uh oh!

Uh oh!

zhming0 Nov 10, 2025

Uh oh!

DrJosh9000 Nov 10, 2025

Uh oh!

zhming0 Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IdleMonitor-related fixes #3570

IdleMonitor-related fixes #3570

Uh oh!

Conversation

DrJosh9000 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Context

Changes

Testing

Disclosures / Credits

Uh oh!

zhming0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhming0 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

DrJosh9000 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

zhming0 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrJosh9000 commented Nov 10, 2025 •

edited

Loading