-
Notifications
You must be signed in to change notification settings - Fork 329
IdleMonitor-related fixes #3570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IdleMonitor-related fixes #3570
Conversation
a5330f7 to
58bc19e
Compare
34cbc10 to
c719cce
Compare
zhming0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I have two non-blocking questions.
In addition, for the sake of transparency, should we describe the bug in PR description?
If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.
| for _, worker := range r.workers { | ||
| go func() { | ||
| errCh <- r.runWorker(ctx, worker) | ||
| defer idleMon.markDead(worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use idleMon.markIdle here, it will work as well? do we need a concept of "dead" in idle monitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a distinction:
Suppose we have two workers and an idle timeout of 600s. Suppose worker 1 finished a job 300 seconds ago and is idle, and worker 2 has just died. No more jobs come in.
If we mark worker 2 as idle, then the agent will start the timeout from now and exit 600s from now (an extra 300s after all alive agents became idle). If we mark worker 2 as dead, then the agent will exit in 300s, which I think is a bit more sensible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
c719cce to
a84fd08
Compare
In #3570 idle logic has changed and `idleMon.markIdle` would get only triggered via `AcceptAndRunJob()`. Agents that never received a job wouldn't be marked as idle.
Description
Various fixes and cleanups related to IdleMonitor. Particularly, if an agent worker stops or crashes, it should be treated as dead for the purposes of the disconnect-after-idle-timeout.
Context
https://linear.app/buildkite/issue/PS-1390/3-buildkite-agent-crashingstopped-workers-breaks-idlemonitor
If multiple workers are spawned and then any of them either crash or are gracefully stopped by the API (or by clicking the Stop Agent button in UI), then idleMonitor.Idle() will never evaluate to true.
Changes
idleMonitornow decides when the agent should exit due to idleness.idleMonitor,newIdleMonitor, and allidleMonitormethods.int->time.Durationconversion earlier (when producingAgentConfiguration).Testing
go test ./...). Buildkite employees may check this if the pipeline has run automatically.go fmt ./...)Disclosures / Credits
I did not use AI tools at all