Update go-cni for CNI STATUS#11135
Conversation
|
Skipping CI for Draft Pull Request. |
c756d11 to
ceb3d77
Compare
Signed-off-by: Michael Zappa <[email protected]>
ceb3d77 to
1f220b2
Compare
|
/cherrypick release/2.0 |
|
@mikebrow: new pull request created: #11146 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherrypick release/1.7 |
|
@MikeZappa87: only containerd org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mikebrow I am not cool and cannot cherry pick |
|
/cherrypick release/1.7 |
|
@dims: #11135 failed to apply on top of branch "release/1.7": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@MikeZappa87 doesn't apply cleanly :( need to file one by hand i guess |
CNI spec 1.1.0 introduced the STATUS verb, allowing plugins to report whether they are ready to handle ADD requests. CRI-O already calls Status() at startup via the ocicni library, but stops polling once the plugin becomes ready for the first time. This means that if a CNI plugin becomes unhealthy during the node's lifetime (e.g. the plugin daemon restarts, configuration is removed and recreated, or the underlying network breaks), CRI-O continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two phases: - Phase 1 (startup): fast poll (500ms) until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Phase 2 (runtime): slow poll (5s) that continuously monitors plugin health. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Additional improvements: - Use wait.PollUntilContextCancel instead of deprecated PollInfinite - Replace shutdown bool with context cancellation for clean goroutine lifecycle - Use non-blocking watcher sends to prevent deadlock when a CNI plugin flaps (ready -> down -> ready) and watcher buffers are full - Fix TOCTOU race in AddWatcher: if the plugin is already healthy when a watcher is registered, immediately return ready instead of blocking forever waiting for a transition that already happened - Set lastError on Shutdown so ReadyOrError() correctly reports not-ready after the manager stops - Use the poll context for GC calls so they are cancellable on shutdown Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CNI spec 1.1.0 introduced the STATUS verb, allowing plugins to report whether they are ready to handle ADD requests. CRI-O already calls Status() at startup via the ocicni library, but stops polling once the plugin becomes ready for the first time. This means that if a CNI plugin becomes unhealthy during the node's lifetime (e.g. the plugin daemon restarts, configuration is removed and recreated, or the underlying network breaks), CRI-O continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two phases: - Phase 1 (startup): fast poll (500ms) until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Phase 2 (runtime): slow poll (5s) that continuously monitors plugin health. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the CNI plugin to become ready, but stops polling after the first success. Historically this was sufficient because Status() only checked whether a CNI config file existed on disk — once present, it would not vanish in normal operation. With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to ask whether it is healthy. This makes continuous polling meaningful: a plugin can report unhealthy even while its config file is still on disk (e.g. the plugin daemon restarts, configuration is regenerated, or the underlying network breaks). However, since CRI-O stops polling after initial readiness, it never detects these runtime failures and continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two phases: - Phase 1 (startup): fast poll (500ms) until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Phase 2 (runtime): slow poll (5s) that continuously monitors plugin health. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the CNI plugin to become ready, but stops polling after the first success. Historically this was sufficient because Status() only checked whether a CNI config file existed on disk — once present, it would not vanish in normal operation. With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to ask whether it is healthy. This makes continuous polling meaningful: a plugin can report unhealthy even while its config file is still on disk (e.g. the plugin daemon restarts, configuration is regenerated, or the underlying network breaks). However, since CRI-O stops polling after initial readiness, it never detects these runtime failures and continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two phases: - Phase 1 (startup): fast poll (500ms) until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Phase 2 (runtime): slow poll (5s) that continuously monitors plugin health. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the CNI plugin to become ready, but stops polling after the first success. Historically this was sufficient because Status() only checked whether a CNI config file existed on disk — once present, it would not vanish in normal operation. With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to ask whether it is healthy. This makes continuous polling meaningful: a plugin can report unhealthy even while its config file is still on disk (e.g. the plugin daemon restarts, configuration is regenerated, or the underlying network breaks). However, since CRI-O stops polling after initial readiness, it never detects these runtime failures and continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two phases: - Phase 1 (startup): fast poll (500ms) until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Phase 2 (runtime): slow poll (5s) that continuously monitors plugin health. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the CNI plugin to become ready, but stops polling after the first success. Historically this was sufficient because Status() only checked whether a CNI config file existed on disk — once present, it would not vanish in normal operation. With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to ask whether it is healthy. This makes continuous polling meaningful: a plugin can report unhealthy even while its config file is still on disk (e.g. the plugin daemon restarts, configuration is regenerated, or the underlying network breaks). However, since CRI-O stops polling after initial readiness, it never detects these runtime failures and continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two stages: - Init poll (500ms): fast poll until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Monitor poll (5s): continuous slow poll that monitors plugin health at runtime. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the CNI plugin to become ready, but stops polling after the first success. Historically this was sufficient because Status() only checked whether a CNI config file existed on disk — once present, it would not vanish in normal operation. With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to ask whether it is healthy. This makes continuous polling meaningful: a plugin can report unhealthy even while its config file is still on disk (e.g. the plugin daemon restarts, configuration is regenerated, or the underlying network breaks). However, since CRI-O stops polling after initial readiness, it never detects these runtime failures and continues to report NetworkReady=true to kubelet. Pods are then scheduled on the node and fail at network setup, instead of the node being marked as not-ready so the scheduler avoids it. This is analogous to how containerd handles CNI STATUS: containerd calls netPlugin.Status() on every CRI Status request from kubelet (see containerd/containerd#11135), so it always reflects the current plugin health. CRI-O's approach of polling at startup and then stopping leaves a gap that this commit fills. Change the CNI manager to poll in two stages: - Init poll (500ms): fast poll until first success. Triggers deferred GC and notifies watchers blocking pod creation. Same as the previous behavior. - Monitor poll (5s): continuous slow poll that monitors plugin health at runtime. On failure, sets lastError so ReadyOrError() reports not-ready and kubelet sees NetworkReady=false. Self-heals when the plugin recovers, including re-running GC to clean up stale resources from the outage period. Replace the shutdown bool with context.WithCancel for goroutine lifecycle management. Shutdown() now cancels the context to stop both poll loops and sets lastError to a sentinel error so ReadyOrError() reports not-ready after shutdown. AddWatcher() checks for the shutdown state to avoid blocking callers that race with shutdown. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O never asks again. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Signed-off-by: Peter Hunt <[email protected]>
CRI-O polls CNI Status() at startup to wait for readiness, but stops after the first success. With the CNI STATUS verb (spec v1.1.0), plugins can now report unhealthy at runtime, but CRI-O doesn't react to this verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon restart, config regeneration, or network failure), CRI-O keeps reporting NetworkReady=true and pods fail at network setup instead of the node being marked not-ready. This is analogous to containerd, which calls Status() on every CRI Status request (see containerd/containerd#11135). Change the polling to two stages: - Init poll (500ms): fast poll until first ready, triggers GC and notifies watchers. Same as before. - Monitor poll (5s): continuous health check. Sets not-ready on failure, self-heals on recovery including GC for stale resources. Also replace the shutdown bool with context cancellation for clean goroutine lifecycle, and handle edge cases in AddWatcher for shutdown and already-ready states. Signed-off-by: Igal Tsoiref <[email protected]> Made-with: Cursor
This PR updates go-cni with the latest version. It provides the CNI Status verb. To make use of this new verb users will need to specify the CNI Version 1.1.0 in the CNI Configuration.