Skip to content

Conversation

@fuweid
Copy link
Member

@fuweid fuweid commented Oct 24, 2025

We have individual goroutine for each sandbox container. If there is any error in handler, that goroutine will put event in that backoff queue. So we don't need event subscriber for podsandbox. Otherwise, there will be two goroutines to cleanup sandbox container.

>>>> From EventMonitor
  time="2025-10-23T19:30:59.626254404Z" level=debug msg="Received containerd event timestamp - 2025-10-23 19:30:59.624494674 +0000 UTC, namespace - \"k8s.io\", topic - \"/tasks/exit\""
  time="2025-10-23T19:30:59.626301912Z" level=debug msg="TaskExit event in podsandbox handler container_id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" pid:203121 exit_status:137 exited_at:{seconds:1761247859 nanos:624467824}"

>>> If EventMonitor handles task exit well, it will close ttrpc
connection and then waitSandboxExit could encounter ttrpc-closed error

  time="2025-10-23T19:30:59.688031150Z" level=error msg="failed to delete task" error="ttrpc: closed" id=22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf

If both task.Delete calls fail but the shim has already been shut down, it could trigger a new task.Exit event sent by cleanupAfterDeadShim. This would result in three events in the EventMonitor's backoff queue, which is unnecessary and could cause confusion due to duplicate events.

The worst-case scenario caused by two concurrent task.Delete calls is a shim leak. The timeline for this scenario is as follows:

Timestamp Component Action Result
T1 EventMonitor Sends task.Delete Marked as Req-1
T2 waitSandboxExit Sends task.Delete Marked as Req-2
T3 containerd-shim Handles Req-2 Container transitions from stopped to deleted
T4 containerd-shim Handles Req-1 Fails - container already deleted
Returns error: cannot delete a deleted process: not found
T5 EventMonitor Receives not found error -
T6 EventMonitor Sends shim.Shutdown request No-op (active container record still exists)
T7 EventMonitor Closes ttrpc connection Clean container state dir
T8 containerd-shim Handles Req-2 Removes container record from memory
T9 waitSandboxExit Receives error Error: ttrpc: closed
T10 waitSandboxExit Sends shim.Shutdown request Fails (connection already closed)
T11 waitSandboxExit Closes ttrpc connection No-op (already closed)

The containerd-shim is still running because shim.Shutdown was sent at T6 before T8. Because container's state dir is deleted at T7, it's unable to clean it up after containerd restarted.

We should avoid concurrent task.Delete calls here.

I also add subcommand - shutdown - in ctr shim for debug.

Fixed: #12344
Cherry-picked: #12400

(cherry picked from commit 2042e80)

We have individual goroutine for each sandbox container. If there is any
error in handler, that goroutine will put event in that backoff queue.
So we don't need event subscriber for podsandbox. Otherwise, there will
be two goroutines to cleanup sandbox container.

```
>>>> From EventMonitor
  time="2025-10-23T19:30:59.626254404Z" level=debug msg="Received containerd event timestamp - 2025-10-23 19:30:59.624494674 +0000 UTC, namespace - \"k8s.io\", topic - \"/tasks/exit\""
  time="2025-10-23T19:30:59.626301912Z" level=debug msg="TaskExit event in podsandbox handler container_id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" pid:203121 exit_status:137 exited_at:{seconds:1761247859 nanos:624467824}"

>>> If EventMonitor handles task exit well, it will close ttrpc
connection and then waitSandboxExit could encounter ttrpc-closed error

  time="2025-10-23T19:30:59.688031150Z" level=error msg="failed to delete task" error="ttrpc: closed" id=22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf
```

If both task.Delete calls fail but the shim has already been shut down, it
could trigger a new task.Exit event sent by cleanupAfterDeadShim. This would
result in three events in the EventMonitor's backoff queue, which is unnecessary
and could cause confusion due to duplicate events.

The worst-case scenario caused by two concurrent task.Delete calls is a shim
leak. The timeline for this scenario is as follows:

| Timestamp | Component       | Action                        | Result                                                                                           |
| ------    | -----------     | --------                      | --------                                                                                         |
| T1        | EventMonitor    | Sends `task.Delete`           | Marked as Req-1                                                                                  |
| T2        | waitSandboxExit | Sends `task.Delete`           | Marked as Req-2                                                                                  |
| T3        | containerd-shim | Handles Req-2                 | Container transitions from stopped to deleted                                                    |
| T4        | containerd-shim | Handles Req-1                 | Fails - container already deleted<br>Returns error: `cannot delete a deleted process: not found` |
| T5        | EventMonitor    | Receives `not found` error    | -                                                                                                |
| T6        | EventMonitor    | Sends `shim.Shutdown` request | No-op (active container record still exists)                                                     |
| T7        | EventMonitor    | Closes ttrpc connection       | Clean container state dir                                                                        |
| T8        | containerd-shim | Handles Req-2                 | Removes container record from memory                                                             |
| T9        | waitSandboxExit | Receives error                | Error: `ttrpc: closed`                                                                           |
| T10       | waitSandboxExit | Sends `shim.Shutdown` request | Fails (connection already closed)                                                                |
| T11       | waitSandboxExit | Closes ttrpc connection       | No-op (already closed)                                                                           |

The containerd-shim is still running because shim.Shutdown was sent at T6
before T8. Because container's state dir is deleted at T7, it's unable to clean
it up after containerd restarted.

We should avoid concurrent task.Delete calls here.

I also add subcommand - shutdown - in `ctr shim` for debug.

Fixed: containerd#12344

Signed-off-by: Wei Fu <[email protected]>
(cherry picked from commit 2042e80)
Signed-off-by: Wei Fu <[email protected]>
@github-project-automation github-project-automation bot moved this to Needs Triage in Pull Request Review Oct 24, 2025
@dosubot dosubot bot added area/cri Container Runtime Interface (CRI) kind/bug labels Oct 24, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Review In Progress in Pull Request Review Oct 28, 2025
@estesp estesp merged commit 477522a into containerd:release/2.1 Oct 28, 2025
145 of 150 checks passed
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in Pull Request Review Oct 28, 2025
@fuweid fuweid deleted the weifu/backport-12400-21 branch October 28, 2025 16:03
@dmcgowan dmcgowan changed the title [release/2.1] cri/server/podsandbox: disable event subscriber [release/2.1] Disable event subscriber during task cleanup Nov 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cri Container Runtime Interface (CRI) impact/changelog kind/bug size/L

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants