Skip to content

Implement RestartAllContainers#134345

Merged
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
yuanwang04:restart-pod
Nov 11, 2025
Merged

Implement RestartAllContainers#134345
k8s-ci-robot merged 5 commits intokubernetes:masterfrom
yuanwang04:restart-pod

Conversation

@yuanwang04
Copy link
Copy Markdown
Contributor

@yuanwang04 yuanwang04 commented Sep 30, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Implement RestartAllContainers KEP: kubernetes/enhancements#5532

Which issue(s) this PR is related to:

KEP: kubernetes/enhancements#5532

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Allows restart all containers when the source container exits with a matching restart policy rule. This is an alpha feature behind feature gate RestartAllContainersOnContainerExit.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/5532

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 30, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Sep 30, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Sep 30, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Apps Sep 30, 2025
@k8s-ci-robot k8s-ci-robot removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 30, 2025
@SergeyKanzhelev
Copy link
Copy Markdown
Member

/cc

marking for review

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 6, 2025
@yuanwang04 yuanwang04 force-pushed the restart-pod branch 3 times, most recently from 78eefbd to 3bb33af Compare October 8, 2025 00:21
@thockin
Copy link
Copy Markdown
Member

thockin commented Nov 7, 2025

/approve for api

@yuanwang04
Copy link
Copy Markdown
Contributor Author

/retest-required

// discovery and inference logic.
var AllFeatures = []nodedeclaredfeatures.Feature{
inplacepodresize.Feature,
restartallcontainers.Feature,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @pravk03
Another customer!

Copy link
Copy Markdown
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way that the pod could transition out of a Failed / Succeeded state with this feature? I'm having a hard time reasoning through the code paths to convince myself that that's not the case.

Comment thread pkg/kubelet/container/helpers.go Outdated
Comment thread pkg/kubelet/container/helpers.go Outdated
Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated
Comment on lines +1042 to +1043
// Kill and remove containers in reverse order. Source containers (which exited and triggered
// RestartAllContainers) are removed last.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the order matter? Why are source containers separated out?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You resolved this, but didn't answer the question?

@tallclair
Copy link
Copy Markdown
Member

Before this change, was it possible for a pod to transition from the Running to Pending phase?

@yuanwang04
Copy link
Copy Markdown
Contributor Author

@tallclair it's impossible for a pod to transition out of a Failed / Succeeded state. This is enforced by kubelet.generateAPIPodStatus function: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L1871-L1885

Before this change, was it possible for a pod to transition from the Running to Pending phase?

I think there is no check / validation forbidding it. I think it may happen if sidecar got restarted before the regular container starts running. However, I do think the pod should be kept in runnings state during the RestartAllContainers action. I did some refactor and added unit tests to ensure the pod status will be running during this restart.

Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated
Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go
Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated
Comment on lines +1042 to +1043
// Kill and remove containers in reverse order. Source containers (which exited and triggered
// RestartAllContainers) are removed last.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You resolved this, but didn't answer the question?

Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated
Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go
Comment thread pkg/kubelet/status/generate.go
Comment thread pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated
return
}
}
if err := m.removeContainer(ctx, containerInfo.containerID.ID); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeContainer also removes the container logs. Is that desired? Normally logs aren't removed when a container restarts.

return
}
}
if err := m.removeContainer(ctx, containerInfo.containerID.ID); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeContainer triggers topology manager to remove the topology assignment for the container. In the case of pod-level resource managers, I think we want to make sure that this doesn't reset the pod-level topology assignment.

/cc @KevinTMtz @ffromani

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. Will read the KEP and the code and incorporate in the pod-level resource manager integration work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naive question because I'm not into this KEP details. Do we run admission again then? because removing the topology assignment MAY mean that a pod which used to run with alignment guarantees loses them on restart, which seems undesirable. If this is real, we should probably add at least an admission check: if topology manager is enforcing, a pod with the RestartAllContainers flag enabled (at glance this seems to be surfacing at podspec level?) should be not admitted.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RestartAllContainers does not seem to affect the overall pod, only its status might become pending, which is not a concern (the CPU and Memory managers reconcileState considers a pending pod as active). In regards to clearing the Topology manager's state, it appears to not produce any secondary effects (the CPU/Memory managers are the effective source of truth), and it seems right to clean up the container ID mappings.

As action items, the RestartAllContainers should introduce a test to verify that the restart flow does not affect the resource manager's functionality (overall CPU and Memory managers features, not specific to PodLevelResourceManagers).

I created the following document with my findings and the discussion that I had with @yuanwang04, @ndixita and @SergeyKanzhelev: PodLevelResourceManagers & RestartAllContainers.

Any work that has to be done to address any situation created between PodLevelResourceManagers and other features can be followed in #136481.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding the reasoning behind removing in-memory topology manager state when a container is removed to RestartAllContainers KEP as a note @yuanwang04

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And just to make sure that we are not missing out of any critical details, by any chance @ffromani @swatisehgal are you aware of why the in-memory topology manager state is retained for containers after the topology manager is done making the affinity decisions? We could have very well cleared the memory used to store the topology manager state

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a reference to a comment relevant to this discussion from the Pod Level Resource Managers KEP PR: kubernetes/enhancements#5775 (comment).

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@tallclair: GitHub didn't allow me to request PR reviews from the following users: KevinTMtz.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

removeContainer triggers topology manager to remove the topology assignment for the container. In the case of pod-level resource managers, I think we want to make sure that this doesn't reset the pod-level topology assignment.

/cc @KevinTMtz @ffromani

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

logger.V(3).Info("Removing container before pod restarts", "containerName", cName, "containerID", containerInfo.containerID, "pod", klog.KObj(pod))
removeContainerResult := kubecontainer.NewSyncResult(kubecontainer.RemoveContainer, cName)
result.AddSyncResult(removeContainerResult)
if containerInfo.kill {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that killContainersWithSyncResult kills the containers in parallel. Would that be preferred here?

@SergeyKanzhelev
Copy link
Copy Markdown
Member

/hold
/milestone v1.35

(as per https://kubernetes.slack.com/archives/C2C40FMNF/p1762622388020479)

Let's close on final comments and merge. @yuanwang04 please unhold at will whenever there is an lgtm

Copy link
Copy Markdown
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Comment thread test/e2e_node/restart_all_containers_test.go Outdated
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 12ea3a573e06c086e46571468ab1867d0c9a23a5

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: SergeyKanzhelev, tallclair, thockin, yuanwang04

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yuanwang04
Copy link
Copy Markdown
Contributor Author

/unhold

@tallclair
Copy link
Copy Markdown
Member

/lgtm

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 06922a6b32b3b49288f8e3cfc497591609c98401

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

k8s-ci-robot commented Nov 11, 2025

@yuanwang04: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-e2e-containerd-alpha-features 8305cbb link false /test pull-kubernetes-node-e2e-containerd-alpha-features

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yuanwang04
Copy link
Copy Markdown
Contributor Author

/retest-required

@kannon92
Copy link
Copy Markdown
Contributor

#135277

Looks like this is failing in the serial lanes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Status: API review completed, 1.35
Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.