Helm: Adding tolerations to block deployment to drained nodes by parlakisik · Pull Request #40137 · cilium/cilium

parlakisik · 2025-06-20T00:16:58Z

Cilium-operator pod automatically rescheduled onto drained mode. This can cause to block some kubernetes upgrades.

Default tolerations for cilium operator are updated with these values defined below.

       - key: "node-role.kubernetes.io/control-plane"
          operator: Exists
        - key: "node-role.kubernetes.io/master" #deprecated
          operator: Exists
        - key: "node.kubernetes.io/not-ready"
          operator: Exists

These are test results

// Testing template output 
#helm template cilium ./cilium --debug
. . . . . 
      hostNetwork: true
      restartPolicy: Always
      priorityClassName: system-cluster-critical
      serviceAccountName: "cilium-operator"
      automountServiceAccountToken: true
      # In HA mode, cilium-operator pods must not be scheduled on the same
      # node as they will clash with each other.
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                io.cilium/app: operator
            topologyKey: kubernetes.io/hostname
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - key: "node-role.kubernetes.io/control-plane"
          operator: Exists
        - key: "node-role.kubernetes.io/master" #deprecated
          operator: Exists
        - key: "node.kubernetes.io/not-ready"
          operator: Exists
        - key: node.cilium.io/agent-not-ready
          operator: Exists
      volumes:
        # To read the configuration from the config map
      - name: cilium-config-path
        configMap:
          name: cilium-config
....

// Testing installation and verification 
#helm install cilium ./install/kubernetes/cilium
#kubectl describe deployment cilium-operator
Name:                   cilium-operator
Namespace:              default
CreationTimestamp:      Thu, 19 Jun 2025 16:58:38 -0700
Labels:                 app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=cilium-operator
                        app.kubernetes.io/part-of=cilium
                        io.cilium/app=operator
                        name=cilium-operator
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: cilium
                        meta.helm.sh/release-namespace: default
Selector:               io.cilium/app=operator,name=cilium-operator
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  50% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/name=cilium-operator
                    app.kubernetes.io/part-of=cilium
                    io.cilium/app=operator
                    name=cilium-operator
  Annotations:      prometheus.io/port: 9963
                    prometheus.io/scrape: true
  Service Account:  cilium-operator
  Containers:
   cilium-operator:
    Image:      quay.io/cilium/operator-generic-ci:latest
    Port:       9963/TCP
    Host Port:  9963/TCP
    Command:
      cilium-operator-generic
    Args:
      --config-dir=/tmp/cilium/config-map
      --debug=$(CILIUM_DEBUG)
    Liveness:   http-get http://127.0.0.1:9234/healthz delay=60s timeout=3s period=10s #success=1 #failure=3
    Readiness:  http-get http://127.0.0.1:9234/healthz delay=0s timeout=3s period=5s #success=1 #failure=5
    Environment:
      K8S_NODE_NAME:          (v1:spec.nodeName)
      CILIUM_K8S_NAMESPACE:   (v1:metadata.namespace)
      CILIUM_DEBUG:          <set to the key 'debug' of config map 'cilium-config'>  Optional: true
    Mounts:
      /tmp/cilium/config-map from cilium-config-path (ro)
  Volumes:
   cilium-config-path:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               cilium-config
    Optional:           false
  Priority Class Name:  system-cluster-critical
  Node-Selectors:       kubernetes.io/os=linux
  Tolerations:          node-role.kubernetes.io/control-plane op=Exists
                        node-role.kubernetes.io/master op=Exists
                        node.cilium.io/agent-not-ready op=Exists
                        node.kubernetes.io/not-ready op=Exists
Conditions:

Fixes: #28549

Change the default taints that Cilium tolerates to avoid deploying to a drained node

maintainer-s-little-helper · 2025-06-20T15:38:34Z

Commit 33dafed does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

install/kubernetes/cilium/values.yaml.tmpl

joestringer · 2025-06-20T17:27:58Z

(Marking draft while you iterate and get CI passing, feel free to use the "Ready for review" button at the end of the page when you'd like feedback)

gandro

Thanks for the PR. This is still a potentially breaking change, but it does fix a bug.

Two points:

What's the reason for moving the default tolerations away from the values into the chart? To be able to add the agent taints?
We need to document this as a minor breaking change in the upgrade nodes

parlakisik · 2025-06-23T15:48:29Z

Thanks for the PR. This is still a potentially breaking change, but it does fix a bug.

Two points:

What's the reason for moving the default tolerations away from the values into the chart? To be able to add the agent taints?

We need to document this as a minor breaking change in the upgrade nodes

I was adding into default tolerations. Then i saw previous PR discussions (#28856) . I updated mine based on that.
So user can set own tolerations , if there is no toleration set. it is updated by default ones.

If this is merged. I can work on documentation in next PR

gandro · 2025-06-24T08:18:56Z

I was adding into default tolerations. Then i saw previous PR discussions (#28856) . I updated mine based on that. So user can set own tolerations , if there is no toleration set. it is updated by default ones.

Thanks for providing this context, this is super useful. The provided tolerations look good to me!

If this is merged. I can work on documentation in next PR

Please add the documentation to this PR, it's how we usually do it for Helm changes. It's easy to get lost or forgotten otherwise and it also helps with more readable release notes.

parlakisik · 2025-07-02T16:42:21Z

@gandro Can you point me which documents should i update ?
I don't want to miss anything.

gandro · 2025-07-03T07:31:15Z

@gandro Can you point me which documents should i update ? I don't want to miss anything.

Sure! There's a section in the upgrade notes here:

cilium/Documentation/operations/upgrade.rst

Lines 291 to 294 in 260af0e

    
           .. _1.18_upgrade_notes: 
        
           1.18 Upgrade Notes 
        
           ------------------

Unfortunately, since we just branched off v1.18, you will the first one to add upgrade notes to v1.19. I recommend creating a 1.19 Upgrade Notes heading above the v1.18 one, and then add a bullet point under the Helm Options subheading saying that the default value for operator.tolerations have changed and explaining in a short sentence what the implications are (see the existing v1.18 Helm upgrade notes as an example).

parlakisik · 2025-07-03T09:01:44Z

Thanks @gandro ,
I will update document today.
Can we tag this one to backport 1.18 too ?

gandro · 2025-07-03T09:54:54Z

Feature freeze for v1.18 unfortunately already happened, so technically we shouldn't include this in v1.18.

Considering that this is potentially a bug fix (but also a breaking change, so we clearly shouldn't backport this to v1.17 or v1.18 after release), I would be okay with backporting this. @joestringer any opinions?

parlakisik · 2025-07-03T15:42:42Z

For 1.17 , We saw this issue in our deployments and we can see the issue in our daily bases deployment tests. We had similar patch in our deployments. I dont see this as a feature . This is kind of a bug for me.

joestringer

The question to me for this PR is not whether this helps to solve problems in your environment, but whether it can break things in other users' environments. Where there is such a risk, we avoid making such changes in later bugfix releases because users expect Cilium a given release series to become more stable over time, not less stable. As such I would not vote to backport to v1.17 release series.

For us to detect a problem and mitigate an issue with this PR, the PR would need to be merged and backported to v1.18 before 2025-07-14 to be included in the next RC. Hopefully in the following two weeks a few users can test out the second RC and report any potential issues for us to then fix in v1.18.0. This isn't a lot of time, but given that it helps with some environments it seems like it may be preferable to users in the target environment facing the same issue for another entire release cycle.

install/kubernetes/cilium/templates/cilium-operator/deployment.yaml

joestringer · 2025-07-03T18:44:17Z

install/kubernetes/cilium/values.schema.json

-              }
-            ]
-          },
+          "items": {},


Why does the schema need to be changed? 🤔

Without understanding too deeply, the idea of the schema is to detect problems in the user configuration and report them to the user before they configure the setting rather than silently accepting values that may be invalid or be ignored.

Changing the schema to remove that validation seems to suggest that the format of the configuration will change, and users won't have helm linting their configuration to help them ensure the configuration is correct.

This is due to parameters are set to null and code generation is confused. once i add these parameters back it will be recoved .

@joestringer Should i also add implement a test case to validate upgrade if user have older version

Could you describe the test case you have in mind?

joestringer · 2025-07-14T16:49:57Z

Thanks for taking care of that @giorio94 🙏

parlakisik · 2025-07-14T16:55:13Z

Thanks @giorio94

In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]>

giorio94 · 2025-07-17T09:39:30Z

~~Dropped the needs-backport label, as superseded by #40531~~ Wrong PR, sorry.

[ upstream commit 92469ca ] In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]> Signed-off-by: Nicolas Busseneau <[email protected]>

[ upstream commit 92469ca ] In #40137 we've added fixed list of tolerations. However, it was not handling agentNotReadyTaintKey toleration, that can be easily changed by users. Signed-off-by: Marcel Zieba <[email protected]> Signed-off-

parlakisik requested review from a team as code owners June 20, 2025 00:16

parlakisik requested review from gandro and joamaki June 20, 2025 00:16

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2025

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jun 20, 2025

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 20, 2025

parlakisik force-pushed the node_drain branch from 33dafed to 61d1373 Compare June 20, 2025 16:26

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jun 20, 2025

joestringer reviewed Jun 20, 2025

View reviewed changes

install/kubernetes/cilium/values.yaml.tmpl Outdated Show resolved Hide resolved

joestringer marked this pull request as draft June 20, 2025 17:27

joestringer added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Jun 20, 2025

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2025

parlakisik force-pushed the node_drain branch from 61d1373 to df3c658 Compare June 20, 2025 20:13

joestringer added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jun 20, 2025

parlakisik marked this pull request as ready for review June 20, 2025 21:33

gandro reviewed Jun 23, 2025

View reviewed changes

joamaki approved these changes Jun 24, 2025

View reviewed changes

joestringer removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Jul 1, 2025

joestringer reviewed Jul 3, 2025

View reviewed changes

parlakisik force-pushed the node_drain branch from df3c658 to b54aec9 Compare July 6, 2025 21:15

marseel mentioned this pull request Jul 15, 2025

operator: fix tolerations for operator #40531

Merged

Conversation

parlakisik commented Jun 20, 2025 • edited by joestringer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Jun 20, 2025

Uh oh!

Uh oh!

joestringer commented Jun 20, 2025

Uh oh!

gandro left a comment

Choose a reason for hiding this comment

Uh oh!

parlakisik commented Jun 23, 2025

Uh oh!

gandro commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parlakisik commented Jul 2, 2025

Uh oh!

gandro commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parlakisik commented Jul 3, 2025

Uh oh!

gandro commented Jul 3, 2025

Uh oh!

parlakisik commented Jul 3, 2025

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joestringer Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

joestringer Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

parlakisik Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

parlakisik Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

joestringer Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

joestringer commented Jul 14, 2025

Uh oh!

parlakisik commented Jul 14, 2025

Uh oh!

giorio94 commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parlakisik commented Jun 20, 2025 •

edited by joestringer

Loading

gandro commented Jun 24, 2025 •

edited

Loading

gandro commented Jul 3, 2025 •

edited

Loading

giorio94 commented Jul 17, 2025 •

edited

Loading