Skip to content

Conversation

@zhming0
Copy link
Contributor

@zhming0 zhming0 commented Dec 11, 2025

Description

Starting #3306 and subsequently worsen by #3398, we have created a double retry mechanism that will make k8s agent to spend 1200 seconds on retry when it's unable to connect to the coordinating agent container.

This PR remove double retry mechanism, centralize the retry logic to socket.Connect.

Context

PS-1491

Testing

  • Tests have run locally (with go test ./...). Buildkite employees may check this if the pipeline has run automatically.
  • Code is formatted (with go tool gofumpt -extra -w .)

Disclosures / Credits

Human

@zhming0 zhming0 requested review from a team December 11, 2025 04:49
return stdoutRedactor, logger
}

func (e *Executor) kubernetesSetup(ctx context.Context, environ *env.Environment, k8sAgentSocket *kubernetes.Client) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code. To be removed in #3629


const retryInterval = time.Second
maxAttempts := max(int(timeout.Seconds()/retryInterval.Seconds()), 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified a bit. It doesn't matter if maxAttempts is fixed at 120 - if the context times out after 120 seconds then the last few attempts won't happen (roko is context aware).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do but the issue is that I can't predict how high people will set the timeout to be 😬 . I set a 3600 max attempt as safety net, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3600 sounds reasonable. The other path would be roko.TryForever, if a user wants to configure the option arbitrarily poorly we could allow that.

Copy link
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my latest comment, LGTM

@zhming0 zhming0 enabled auto-merge December 11, 2025 22:52
@zhming0 zhming0 merged commit 23e358f into main Dec 11, 2025
2 checks passed
@zhming0 zhming0 deleted the ming/ps-1491 branch December 11, 2025 23:00
@zhming0 zhming0 mentioned this pull request Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants