Config for timeouts and delays in probes#11458
Conversation
adleong
left a comment
There was a problem hiding this comment.
Thanks @jan-kantert!
The default probe timeouts are 1s, so that's the current behavior. Changing the defaults from 1s to 30s and 5s respectively feels like a big change. I'm not sure how we would determine if these defaults are better.
What do you think about leaving the defaults as the current value of 1s? At least this change will allow users to configure these timeouts explicitly without changing the default behavior.
Sure we can do that. However, those small timeouts cause instability on clusters with high load. I personally would still at least increase the default for the livenessProbe (as Kubernetes recommends to have a longer timeout for your livenessProbe than your readinessProbe). This timeout also has been introduced more or less silently with kubernetes 1.20 (https://kubernetes.io/blog/2020/12/08/kubernetes-1-20-release-announcement/#exec-probe-timeout-handling). |
alpeb
left a comment
There was a problem hiding this comment.
Thanks @jan-kantert 👍
Please complete the Proxy struct in ./pkg/charts/linkerd2/values.go with these new entries for the linkerd install CLI to be able to render them.
Also, please run ./bin/helm-docs for the comments added into the values.yaml file to be picked up in the chart's README.md file, and go test ./... -update to update the unit test golden files.
|
Hi @jan-kantert, just checking in to see if you have time to make the above changes? |
|
Hey @jan-kantert, just to give you an update: I pushed the requested changes to your branch. I did not change anything in the implementation (but I did modify the file to resolve a merge conflict) -- most of my work has been to fix the tests and get this reviewable. Your effort is super appreciated! I think due to the nature of the issue itself (that you have described in #11453) we felt that it would be better to get this fix in soon so we can get it tested asap. The contribution is solely yours :) |
charts/partials/templates/_proxy.tpl
Outdated
| initialDelaySeconds: {{.Values.proxy.livenessProbe.initialDelaySeconds | default 10}} | ||
| timeoutSeconds: {{.Values.proxy.livenessProbe.timeoutSeconds | default 1}} |
There was a problem hiding this comment.
I would get rid of the default statements, as we already have defaults in the values.yaml file, and to be consistent with the rest of the file (see startupProbe).
There was a problem hiding this comment.
yeah I did not want to touch the impl but I agree.
| defaultInboundPolicy: "all-unauthenticated" | ||
| # -- LivenessProbe timeout and delay configuration | ||
| livenessProbe: | ||
| initialDelaySeconds: 10 |
There was a problem hiding this comment.
So the default here used to be zero? Setting the new default to 10 is going to significantly delay the startup time for pods. How about leaving the default as it was? Pathological cases would still have an escape hatch with this new setting.
There was a problem hiding this comment.
it was hardcoded in to 10, see the _proxy.tpl change. The default did not change. The only defaults we have changed are timeouts since we've added them in.
|
@jan-kantert we've noticed your DCO check wasn't properly signed. We can take this across the finish line if you'd like, but we still need to have all of the commits signed to pass the check :) |
|
@jan-kantert Hey! We'd like to take this, if you can confirm that you agree with the whole DCO thing. A comment on this issue is fine. 🙂 |
|
I agree to the DCO! Sorry that I did not get to it before Christmas. Thanks for taking care! |
Add timeout config for readinessProbe and livenessProbes
Add defaults for timeouts and delays
Changed to previous default
changed to previous default
Signed-off-by: Matei David <[email protected]>
Signed-off-by: Matei David <[email protected]>
08d7a34 to
c39b6db
Compare
alpeb
left a comment
There was a problem hiding this comment.
I've rebased and pushed some minor changes. Good to go IMO 👍
adleong
left a comment
There was a problem hiding this comment.
This adds a way to configure probe initialDelay and timeout. Do we also anticipate needing to configure probe period? Should we just add that as well while we're here? Worth a thought but not a blocker.
This release addresses some issues in the destination service that could cause it to behave unexpectedly when processing updates. * Fixed a race condition in the destination service that could cause panics under very specific conditions ([#12022]; fixes [#12010]) * Changed how updates to a `Server` selector are handled in the destination service. When a `Server` that marks a port as opaque no longer selects a resource, the resource's opaqueness will reverted to default settings ([#12031]; fixes [#11995]) * Introduced Helm configuration values for liveness and readiness probe timeouts and delays ([#11458]; fixes [#11453]) (thanks @jan-kantert!) [#12010]: #12010 [#12022]: #12022 [#11995]: #11995 [#12031]: #12031 [#11453]: #11453 [#11458]: #11458 Signed-off-by: Matei David <[email protected]>
Proposed change to implement #11453