-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't stop informer delivery on error #58394
don't stop informer delivery on error #58394
Conversation
p.handler.OnDelete(notification.oldObj) | ||
default: | ||
utilruntime.HandleError(fmt.Errorf("unrecognized notification: %#v", next)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop runs until p.nextCh
is closed which means normal graceful shutdown. So, instead of adding stopCh
as a parameter to the func, it is possible to make a channel right before wait.Until()
is invoked and close()
it right after the for
loop. Does it make sense?
@wryun, I think we saw a case where a panic in informer handler did not terminate the program but left it in a zombie state? Will this help (I don't remember the details)? |
@ash2k yes, it will improve matters (we were panicking in the handler - this will at least stop it stalling after this). Without properly understanding the implications, though, I think I'd still prefer it if the whole process shut-down for our particular case... since the cache is probably now a mess, and subsequent processing might be incorrect (I think?). |
@wryun Panicking handler does not spoil the cache. Also other handlers for the same event are executed (concurrently) regardless of this. |
func (p *processorListener) run(stopCh <-chan struct{}) { | ||
// this call blocks until the channel is closed. When a panic happens during the notification | ||
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second) | ||
// the next notification will be attempted. This is usually better than the alternative of never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... if there's a more permanent problem with delivering, this will keep failing every second, why not using some backoff here as well?
39c5ee0
to
0a6bdbd
Compare
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
0a6bdbd
to
a7486f6
Compare
Comments addressed. Take a look at the new impl |
a7486f6
to
5f7b3fd
Compare
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second) | ||
// the next notification will be attempted. This is usually better than the alternative of never | ||
// delivering again. | ||
wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ExponentialBackoff repeats a condition check with exponential backoff.
//
// It checks the condition up to Steps times, increasing the wait by multiplying
// the previous duration by Factor.
//
// If Jitter is greater than zero, a random amount of each duration is added
// (between duration and duration*(1+jitter)).
//
// If the condition never returns true, ErrWaitTimeout is returned. All other
// errors terminate immediately.
So the current code will skip 5 (value from retry.DefaultRetry
) errors with exponential delay and then stop processing forever. Not what we want, right?
Maybe in case of an error an event should be emitted? Is there class of events to notify cluster administrator that something is not right?
I think ideally the backoff should be capped exponential but should reduce with time back to the initial delay if there is no errors. I personally would be happy with a fixed delay of a few seconds. Maybe bigger fixed delay is ok because cluster slowdown will let the operator know that something is not working properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'll teach me to skip reading the godoc on the function. I could wrap it in a second loop and does a fairly long pause (minute?) and then starts this over again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
5f7b3fd
to
2fa93da
Compare
now with more retries. |
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
p.s. I've recently learned a github trick - if you append ?w=1
to the url, it ignores whitespace changes in the diff. Much easier to review PRs like this one.
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: ash2k, deads2k No associated issue. Requirement bypassed by manually added approval. The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
just a generated godeps.json file. Applying approval manually. @sttts in case he gets in tomorrow in time for paperwork. |
/test all [submit-queue is verifying that this PR is safe to merge] |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
Automatic merge from submit-queue (batch tested with PRs 58412, 56132, 58506, 58542, 58394). If you want to cherry-pick this change to another branch, please follow the instructions here. |
If an informer delivery fails today, we stop delivering to it entirely. The pull updates the code to skip that particular notification, delay, and continue delivery with the next time.
/assign derekwaynecarr
/assign ncdc
/assign ash2k
@derekwaynecarr This would change the "the controller isn't doing anything?!" to "the controller missed my (individual) resource!"