-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set leader election client and renew timeout #65094
set leader election client and renew timeout #65094
Conversation
/assign @mikedanese @timothysc /cc @kubernetes/api-reviewers |
6751dbd
to
2dc249b
Compare
@@ -201,7 +201,19 @@ func (le *LeaderElector) renew() { | |||
stop := make(chan struct{}) | |||
wait.Until(func() { | |||
err := wait.Poll(le.config.RetryPeriod, le.config.RenewDeadline, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Poll already retries and timesout. What does this do differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed wait.Poll check timeout before running ConditionFunc
. If ConditionFunc takes too long, it can not quit quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, gross. Ok, let me think about this.
2dc249b
to
256f6bf
Compare
leaderElectionClient := clientset.NewForConfigOrDie(restclient.AddUserAgent(kubeconfig, "leader-election")) | ||
// shallow copy, do not modify the kubeconfig.Timeout. | ||
config := *kubeconfig | ||
config.Timeout = s.GenericComponent.LeaderElection.RenewDeadline.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient to set this to RetryPeriod? I think then we can get away without the other timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this timeout is to set for http client, used to prevent blocking forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what it is used for. If we retry every RetryPeriod anyway, why would we want a client to block for longer than RetryPeriod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the timeout less than RetryPeriod
can work well. But if the network is unstable, and it takes longer to finish the renew. During this period, this client loses leadership.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my own thought. Would not mind different voice.
return le.tryAcquireOrRenew(), nil | ||
var succeed bool | ||
done := make(chan struct{}) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These goroutines are going to leak. This concerns me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After setting timeout, the goroutine can definitely exit with at most less than twice RenewDeadline
time.
@@ -289,7 +290,9 @@ func createClients(config componentconfig.ClientConnectionConfiguration, masterO | |||
return nil, nil, nil, err | |||
} | |||
|
|||
leaderElectionClient, err := clientset.NewForConfig(restclient.AddUserAgent(kubeConfig, "leader-election")) | |||
restConfig := *restclient.AddUserAgent(kubeConfig, "leader-election") | |||
restConfig.Timeout = timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You set Timeout on return of restclient.AddUserAgent here and on kubeconfig in cmd/kube-controller-manager/app/options/options.go
Make them consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
done := make(chan struct{}) | ||
go func() { | ||
defer close(done) | ||
succeed = le.tryAcquireOrRenew() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making done = make(chan bool, 1)
and sending result onto it and removing succeed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can also achieve the same effect, but I think it is not direct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this is more direct.
And avoids the potential of racing on the succeed var. Access is synchronized on done channel right now but my first reaction was "it's read and written to concurrently".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you persuaded me
|
||
select { | ||
case <-time.After(le.config.LeaseDuration - le.config.RetryPeriod): | ||
return false, fmt.Errorf("timeout trying acquire or renew") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add timeout duration to the message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
}() | ||
|
||
select { | ||
case <-time.After(le.config.LeaseDuration - le.config.RetryPeriod): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not re-use timeoutCtx.Done()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense
256f6bf
to
1ea4560
Compare
/retest |
|
||
select { | ||
case <-timeoutCtx.Done(): | ||
return false, fmt.Errorf("tryAcquireOrRenew timed out after %s", le.config.RenewDeadline) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, now that this is based on the context, add timeoutCtx.Err()
to the message.
The parent ctx passed to renew
could've been cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
done := make(chan struct{}) | ||
go func() { | ||
defer close(done) | ||
succeed = le.tryAcquireOrRenew() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue this is more direct.
And avoids the potential of racing on the succeed var. Access is synchronized on done channel right now but my first reaction was "it's read and written to concurrently".
1ea4560
to
d344648
Compare
7b6080c
to
ff847bd
Compare
/lgtm |
/approve |
/hold |
ff847bd
to
90b287c
Compare
@awly @mikedanese squashed the last two commits |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: awly, hzxuzhonghu, mikedanese The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
/retest Review the full test history for this PR. Silence the bot with an |
Automatic merge from submit-queue (batch tested with PRs 65094, 65533, 63522, 65694, 65702). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
set leader-election client timeout
set timeout for tryAcquireOrRenew
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #65090 #65257
Special notes for your reviewer:
Release note: