Skip to content

Update go-cni for CNI STATUS#11135

Merged
mikebrow merged 1 commit intocontainerd:mainfrom
MikeZappa87:feat/cnistatus
Dec 11, 2024
Merged

Update go-cni for CNI STATUS#11135
mikebrow merged 1 commit intocontainerd:mainfrom
MikeZappa87:feat/cnistatus

Conversation

@MikeZappa87
Copy link
Copy Markdown
Member

@MikeZappa87 MikeZappa87 commented Dec 11, 2024

This PR updates go-cni with the latest version. It provides the CNI Status verb. To make use of this new verb users will need to specify the CNI Version 1.1.0 in the CNI Configuration.

@k8s-ci-robot
Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@MikeZappa87 MikeZappa87 marked this pull request as ready for review December 11, 2024 15:42
@dosubot dosubot Bot added the dependencies Pull requests that update a dependency file label Dec 11, 2024
@MikeZappa87 MikeZappa87 requested a review from mikebrow December 11, 2024 16:39
Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow mikebrow added this pull request to the merge queue Dec 11, 2024
Merged via the queue into containerd:main with commit 41bc049 Dec 11, 2024
@mikebrow mikebrow added the cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch label Dec 11, 2024
@mikebrow
Copy link
Copy Markdown
Member

/cherrypick release/2.0

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@mikebrow: new pull request created: #11146

Details

In response to this:

/cherrypick release/2.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MikeZappa87
Copy link
Copy Markdown
Member Author

/cherrypick release/1.7

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@MikeZappa87: only containerd org members may request cherry picks. If you are already part of the org, make sure to change your membership to public. Otherwise you can still do the cherry-pick manually.

Details

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MikeZappa87
Copy link
Copy Markdown
Member Author

@mikebrow I am not cool and cannot cherry pick

@dims
Copy link
Copy Markdown
Member

dims commented Dec 12, 2024

/cherrypick release/1.7

@k8s-infra-cherrypick-robot
Copy link
Copy Markdown

@dims: #11135 failed to apply on top of branch "release/1.7":

Applying: feat: update go-cni version for CNI STATUS
Using index info to reconstruct a base tree...
M	go.mod
M	go.sum
M	vendor/modules.txt
Falling back to patching base and 3-way merge...
Auto-merging vendor/modules.txt
CONFLICT (content): Merge conflict in vendor/modules.txt
Auto-merging go.sum
CONFLICT (content): Merge conflict in go.sum
Auto-merging go.mod
CONFLICT (content): Merge conflict in go.mod
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 feat: update go-cni version for CNI STATUS

Details

In response to this:

/cherrypick release/1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dims
Copy link
Copy Markdown
Member

dims commented Dec 12, 2024

@MikeZappa87 doesn't apply cleanly :( need to file one by hand i guess

@dmcgowan dmcgowan added cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch and removed cherry-pick/2.0.x Change to be cherry picked to release/2.0 branch labels Dec 12, 2024
tsorya added a commit to tsorya/cri-o that referenced this pull request Mar 28, 2026
CNI spec 1.1.0 introduced the STATUS verb, allowing plugins to report
whether they are ready to handle ADD requests. CRI-O already calls
Status() at startup via the ocicni library, but stops polling once the
plugin becomes ready for the first time. This means that if a CNI
plugin becomes unhealthy during the node's lifetime (e.g. the plugin
daemon restarts, configuration is removed and recreated, or the
underlying network breaks), CRI-O continues to report NetworkReady=true
to kubelet. Pods are then scheduled on the node and fail at network
setup, instead of the node being marked as not-ready so the scheduler
avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two phases:
- Phase 1 (startup): fast poll (500ms) until first success. Triggers
  deferred GC and notifies watchers blocking pod creation. Same as
  the previous behavior.
- Phase 2 (runtime): slow poll (5s) that continuously monitors plugin
  health. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Additional improvements:
- Use wait.PollUntilContextCancel instead of deprecated PollInfinite
- Replace shutdown bool with context cancellation for clean goroutine
  lifecycle
- Use non-blocking watcher sends to prevent deadlock when a CNI plugin
  flaps (ready -> down -> ready) and watcher buffers are full
- Fix TOCTOU race in AddWatcher: if the plugin is already healthy when
  a watcher is registered, immediately return ready instead of blocking
  forever waiting for a transition that already happened
- Set lastError on Shutdown so ReadyOrError() correctly reports
  not-ready after the manager stops
- Use the poll context for GC calls so they are cancellable on shutdown

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Mar 28, 2026
CNI spec 1.1.0 introduced the STATUS verb, allowing plugins to report
whether they are ready to handle ADD requests. CRI-O already calls
Status() at startup via the ocicni library, but stops polling once the
plugin becomes ready for the first time. This means that if a CNI
plugin becomes unhealthy during the node's lifetime (e.g. the plugin
daemon restarts, configuration is removed and recreated, or the
underlying network breaks), CRI-O continues to report NetworkReady=true
to kubelet. Pods are then scheduled on the node and fail at network
setup, instead of the node being marked as not-ready so the scheduler
avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two phases:
- Phase 1 (startup): fast poll (500ms) until first success. Triggers
  deferred GC and notifies watchers blocking pod creation. Same as
  the previous behavior.
- Phase 2 (runtime): slow poll (5s) that continuously monitors plugin
  health. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Mar 28, 2026
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the
CNI plugin to become ready, but stops polling after the first success.
Historically this was sufficient because Status() only checked whether
a CNI config file existed on disk — once present, it would not vanish
in normal operation.

With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni
v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to
ask whether it is healthy. This makes continuous polling meaningful:
a plugin can report unhealthy even while its config file is still on
disk (e.g. the plugin daemon restarts, configuration is regenerated,
or the underlying network breaks). However, since CRI-O stops polling
after initial readiness, it never detects these runtime failures and
continues to report NetworkReady=true to kubelet. Pods are then
scheduled on the node and fail at network setup, instead of the node
being marked as not-ready so the scheduler avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two phases:
- Phase 1 (startup): fast poll (500ms) until first success. Triggers
  deferred GC and notifies watchers blocking pod creation. Same as
  the previous behavior.
- Phase 2 (runtime): slow poll (5s) that continuously monitors plugin
  health. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Mar 29, 2026
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the
CNI plugin to become ready, but stops polling after the first success.
Historically this was sufficient because Status() only checked whether
a CNI config file existed on disk — once present, it would not vanish
in normal operation.

With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni
v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to
ask whether it is healthy. This makes continuous polling meaningful:
a plugin can report unhealthy even while its config file is still on
disk (e.g. the plugin daemon restarts, configuration is regenerated,
or the underlying network breaks). However, since CRI-O stops polling
after initial readiness, it never detects these runtime failures and
continues to report NetworkReady=true to kubelet. Pods are then
scheduled on the node and fail at network setup, instead of the node
being marked as not-ready so the scheduler avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two phases:
- Phase 1 (startup): fast poll (500ms) until first success. Triggers
  deferred GC and notifies watchers blocking pod creation. Same as
  the previous behavior.
- Phase 2 (runtime): slow poll (5s) that continuously monitors plugin
  health. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 7, 2026
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the
CNI plugin to become ready, but stops polling after the first success.
Historically this was sufficient because Status() only checked whether
a CNI config file existed on disk — once present, it would not vanish
in normal operation.

With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni
v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to
ask whether it is healthy. This makes continuous polling meaningful:
a plugin can report unhealthy even while its config file is still on
disk (e.g. the plugin daemon restarts, configuration is regenerated,
or the underlying network breaks). However, since CRI-O stops polling
after initial readiness, it never detects these runtime failures and
continues to report NetworkReady=true to kubelet. Pods are then
scheduled on the node and fail at network setup, instead of the node
being marked as not-ready so the scheduler avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two phases:
- Phase 1 (startup): fast poll (500ms) until first success. Triggers
  deferred GC and notifies watchers blocking pod creation. Same as
  the previous behavior.
- Phase 2 (runtime): slow poll (5s) that continuously monitors plugin
  health. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 7, 2026
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the
CNI plugin to become ready, but stops polling after the first success.
Historically this was sufficient because Status() only checked whether
a CNI config file existed on disk — once present, it would not vanish
in normal operation.

With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni
v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to
ask whether it is healthy. This makes continuous polling meaningful:
a plugin can report unhealthy even while its config file is still on
disk (e.g. the plugin daemon restarts, configuration is regenerated,
or the underlying network breaks). However, since CRI-O stops polling
after initial readiness, it never detects these runtime failures and
continues to report NetworkReady=true to kubelet. Pods are then
scheduled on the node and fail at network setup, instead of the node
being marked as not-ready so the scheduler avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two stages:
- Init poll (500ms): fast poll until first success. Triggers deferred
  GC and notifies watchers blocking pod creation. Same as the previous
  behavior.
- Monitor poll (5s): continuous slow poll that monitors plugin health
  at runtime. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 7, 2026
CRI-O's CNI manager calls ocicni's Status() at startup to wait for the
CNI plugin to become ready, but stops polling after the first success.
Historically this was sufficient because Status() only checked whether
a CNI config file existed on disk — once present, it would not vanish
in normal operation.

With the addition of the CNI STATUS verb in spec v1.1.0 and the ocicni
v0.4.3 bump, Status() now also invokes the actual CNI plugin binary to
ask whether it is healthy. This makes continuous polling meaningful:
a plugin can report unhealthy even while its config file is still on
disk (e.g. the plugin daemon restarts, configuration is regenerated,
or the underlying network breaks). However, since CRI-O stops polling
after initial readiness, it never detects these runtime failures and
continues to report NetworkReady=true to kubelet. Pods are then
scheduled on the node and fail at network setup, instead of the node
being marked as not-ready so the scheduler avoids it.

This is analogous to how containerd handles CNI STATUS: containerd
calls netPlugin.Status() on every CRI Status request from kubelet
(see containerd/containerd#11135), so it always reflects the current
plugin health. CRI-O's approach of polling at startup and then stopping
leaves a gap that this commit fills.

Change the CNI manager to poll in two stages:
- Init poll (500ms): fast poll until first success. Triggers deferred
  GC and notifies watchers blocking pod creation. Same as the previous
  behavior.
- Monitor poll (5s): continuous slow poll that monitors plugin health
  at runtime. On failure, sets lastError so ReadyOrError() reports
  not-ready and kubelet sees NetworkReady=false. Self-heals when the
  plugin recovers, including re-running GC to clean up stale resources
  from the outage period.

Replace the shutdown bool with context.WithCancel for goroutine
lifecycle management. Shutdown() now cancels the context to stop both
poll loops and sets lastError to a sentinel error so ReadyOrError()
reports not-ready after shutdown. AddWatcher() checks for the shutdown
state to avoid blocking callers that race with shutdown.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 7, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O never asks again. If a
plugin becomes unhealthy (e.g. daemon restart, config regeneration, or
network failure), CRI-O keeps reporting NetworkReady=true and pods fail
at network setup instead of the node being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 7, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 9, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 9, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 13, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 13, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 13, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 14, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 15, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 16, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 16, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 17, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
tsorya added a commit to tsorya/cri-o that referenced this pull request Apr 17, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
hswong3i pushed a commit to alvistack/cri-o-cri-o that referenced this pull request Apr 23, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
haircommander pushed a commit to cri-o/cri-o that referenced this pull request Apr 24, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Signed-off-by: Peter Hunt <[email protected]>
hswong3i pushed a commit to alvistack/cri-o-cri-o that referenced this pull request Apr 27, 2026
CRI-O polls CNI Status() at startup to wait for readiness, but stops
after the first success. With the CNI STATUS verb (spec v1.1.0), plugins
can now report unhealthy at runtime, but CRI-O doesn't react to this
verb after initial readiness. If a plugin becomes unhealthy (e.g. daemon
restart, config regeneration, or network failure), CRI-O keeps reporting
NetworkReady=true and pods fail at network setup instead of the node
being marked not-ready.

This is analogous to containerd, which calls Status() on every CRI
Status request (see containerd/containerd#11135).

Change the polling to two stages:
- Init poll (500ms): fast poll until first ready, triggers GC and
  notifies watchers. Same as before.
- Monitor poll (5s): continuous health check. Sets not-ready on failure,
  self-heals on recovery including GC for stale resources.

Also replace the shutdown bool with context cancellation for clean
goroutine lifecycle, and handle edge cases in AddWatcher for shutdown
and already-ready states.

Signed-off-by: Igal Tsoiref <[email protected]>
Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/2.0.x PR commits are cherry picked into the release/2.0 branch dependencies Pull requests that update a dependency file size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants