Helm: Adding tolerations to block deployment to drained nodes#40137
Helm: Adding tolerations to block deployment to drained nodes#40137joestringer merged 1 commit intocilium:mainfrom
Conversation
|
Commit 33dafed does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
|
(Marking draft while you iterate and get CI passing, feel free to use the "Ready for review" button at the end of the page when you'd like feedback) |
gandro
left a comment
There was a problem hiding this comment.
Thanks for the PR. This is still a potentially breaking change, but it does fix a bug.
Two points:
- What's the reason for moving the default tolerations away from the values into the chart? To be able to add the agent taints?
- We need to document this as a minor breaking change in the upgrade nodes
I was adding into default tolerations. Then i saw previous PR discussions (#28856) . I updated mine based on that. If this is merged. I can work on documentation in next PR |
Thanks for providing this context, this is super useful. The provided tolerations look good to me!
Please add the documentation to this PR, it's how we usually do it for Helm changes. It's easy to get lost or forgotten otherwise and it also helps with more readable release notes. |
|
@gandro Can you point me which documents should i update ? |
Sure! There's a section in the upgrade notes here: cilium/Documentation/operations/upgrade.rst Lines 291 to 294 in 260af0e Unfortunately, since we just branched off v1.18, you will the first one to add upgrade notes to v1.19. I recommend creating a |
|
Thanks @gandro , |
|
Feature freeze for v1.18 unfortunately already happened, so technically we shouldn't include this in v1.18. Considering that this is potentially a bug fix (but also a breaking change, so we clearly shouldn't backport this to v1.17 or v1.18 after release), I would be okay with backporting this. @joestringer any opinions? |
|
For 1.17 , We saw this issue in our deployments and we can see the issue in our daily bases deployment tests. We had similar patch in our deployments. I dont see this as a feature . This is kind of a bug for me. |
joestringer
left a comment
There was a problem hiding this comment.
The question to me for this PR is not whether this helps to solve problems in your environment, but whether it can break things in other users' environments. Where there is such a risk, we avoid making such changes in later bugfix releases because users expect Cilium a given release series to become more stable over time, not less stable. As such I would not vote to backport to v1.17 release series.
For us to detect a problem and mitigate an issue with this PR, the PR would need to be merged and backported to v1.18 before 2025-07-14 to be included in the next RC. Hopefully in the following two weeks a few users can test out the second RC and report any potential issues for us to then fix in v1.18.0. This isn't a lot of time, but given that it helps with some environments it seems like it may be preferable to users in the target environment facing the same issue for another entire release cycle.
install/kubernetes/cilium/templates/cilium-operator/deployment.yaml
Outdated
Show resolved
Hide resolved
| } | ||
| ] | ||
| }, | ||
| "items": {}, |
There was a problem hiding this comment.
Why does the schema need to be changed? 🤔
There was a problem hiding this comment.
Without understanding too deeply, the idea of the schema is to detect problems in the user configuration and report them to the user before they configure the setting rather than silently accepting values that may be invalid or be ignored.
Changing the schema to remove that validation seems to suggest that the format of the configuration will change, and users won't have helm linting their configuration to help them ensure the configuration is correct.
There was a problem hiding this comment.
This is due to parameters are set to null and code generation is confused. once i add these parameters back it will be recoved .
There was a problem hiding this comment.
@joestringer Should i also add implement a test case to validate upgrade if user have older version
There was a problem hiding this comment.
Could you describe the test case you have in mind?
|
Thanks for taking care of that @giorio94 🙏 |
|
Thanks @giorio94 |
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>
|
|
[ upstream commit 92469ca ] In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]> Signed-off-by: Nicolas Busseneau <[email protected]>
[ upstream commit 92469ca ] In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]> Signed-off-by: Nicolas Busseneau <[email protected]>
[ upstream commit 92469ca ] In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]> Signed-off-
Cilium-operator pod automatically rescheduled onto drained mode. This can cause to block some kubernetes upgrades.
Default tolerations for cilium operator are updated with these values defined below.
These are test results
Fixes: #28549