Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker #102344

smarterclayton · 2021-05-26T17:02:50Z

A number of race conditions exist when pods are terminated early in their lifecycle because components in the kubelet need to know "no running containers" or "containers can't be started from now on" but were relying on outdated state (they don't know whether the pod is setting up, tearing down, or can never be started again).

Only the pod worker knows whether containers are being started for a given pod, which is required to know when a pod is "terminated" (no running containers, none coming). Move that responsibility and podKiller function into the pod workers, and have everything that was killing the pod go into the UpdatePod loop. Have pod workers remain running until the pod is ready to be deleted in etcd (PodResourcesAreReclaimed would return true). Split syncPod into three phases - setup, terminate containers, and cleanup pod - and have transitions between those methods be visible to other components. After this change, to kill a pod you tell the pod worker to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).

Several places in the kubelet were incorrect about whether they were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker exposes methods that allow other loops to know when to set up or tear down resources based on the state of the pod - these methods remove the possibility of race conditions by ensuring a single component is responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.

In addition, removing containers now no longer blocks final pod deletion in the API server and are handled as background cleanup.

See https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit# for details

TODO:

Implement context cancellation on syncPod and terminatingPod when necessary
Add an e2e test like "pod submit and delete" with init containers Add init container pod deletion test #103128
Add an e2e test that verifies that logs remain and are accessible until the pod object is deleted (set a 15s grace period, write a log message every second)
Thoroughly review that no regressions in cleanup are present - setup in loops should use the right conditions, teardown should also use them
Improve pod sync latency by having pod worker track the latency

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a number of race conditions in the kubelet when pods are starting up or shutting down that might cause pods to take a long time to shut down.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

smarterclayton · 2021-05-26T17:11:00Z

/priority critical-urgent

Race conditions in the kubelet mean that quickly deleted pods can leave dangling resources for significant amounts of time (although this is enough of a change that we shouldn't rush to merge it until we're confident it's both correct, safe, and improves testing overall)

ehashman · 2021-05-26T17:40:59Z

/triage accepted

msau42 · 2021-05-26T18:38:12Z

/assign @gnufied @jingxu97
for volumemanager implications

If I understand correctly, this may help resolve some of the races we've seen such as #69831, #96759, #101911

ehashman · 2021-05-26T18:44:22Z

Doc is https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit# (link in first comment is the RH internal draft)

* Bump the pod status and node status update timeouts to avoid flakes * Add a small delay after dbus restart to ensure dbus has enough time to restart to startup prior to sending shutdown signal * Change check of pod being terminated by graceful shutdown. Previously, the pod phase was checked to see if it was `Failed` and the pod reason string matched. This logic needs to change after 1.22 graceful node shutdown change introduced in PR kubernetes#102344 which changed behavior to no longer put the pods into a failed phase. Instead, the test now checks that containers are not ready, and the pod status message and reason are set appropriately. Signed-off-by: David Porter <[email protected]>

k8s-ci-robot requested review from andrewsykim and msau42 May 26, 2021 17:03

smarterclayton force-pushed the keep_pod_worker branch from 9679e37 to 54021e0 Compare May 26, 2021 17:03

smarterclayton mentioned this pull request May 26, 2021

Prevent pods from defaulting to zero second grace periods #102025

Closed

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 26, 2021

smarterclayton changed the title ~~Keep pod worker running until pod is truly complete~~ Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker May 26, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 26, 2021

smarterclayton force-pushed the keep_pod_worker branch from 54021e0 to e4d4b3b Compare May 26, 2021 19:02

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 26, 2021

wzshiming mentioned this pull request Jul 9, 2021

Graceful Node Shutdown Based On Pod Priority #102915

Merged

openshift-ci-robot mentioned this pull request Jul 9, 2021

Bug 1952224: UPSTREAM: 102344: Keep pod worker running until pod is truly complete openshift/kubernetes#851

Closed

ehashman mentioned this pull request Jul 9, 2021

Kubelet panic in syncTerminatedPod #103625

Closed

openshift-ci-robot mentioned this pull request Jul 12, 2021

TEST ONLY - Test a fix for race condition in sync openshift/kubernetes#855

Closed

endocrimes mentioned this pull request Jul 16, 2021

[Failing test] [sig-node] InodeEviction [Slow] [Serial] [Disruptive][NodeFeature:Eviction] when we run containers that should cause DiskPressure should eventually evict all of the correct pods #103623

Closed

This was referenced Jul 19, 2021

Revert #97980 Check for sandboxes before deleting the pod from apiserver #98934

Closed

Ensure that Reason and Message are preserved on pod status #103785

Merged

This was referenced Jul 21, 2021

[Failing Test] [sig-storage] subPath should unmount if pod is deleted while kubelet is down (ci-kubernetes-e2e-gci-gce-serial) #103651

Closed

Improve storage test skipping pattern. #103876

Merged

This was referenced Jul 27, 2021

The CPU manager does not work correctly for the guaranteed pod with multiple containers #103952

Closed

dockershim takes 1h30m to successfully kill a pod in node-serial tests #104017

Closed

bobbypage mentioned this pull request Aug 25, 2021

1.22 regression: Kubelet rejects pods that use resources that should be freed by completed pods #104560

Closed

liggitt mentioned this pull request Aug 29, 2021

1.22 regression: removing and recreating static pod manifest leaves pod in error state #104648

Closed

bobbypage mentioned this pull request Sep 8, 2021

Avoid Terminated Pods from being restarted after the node restarts #104798

Closed

kmala mentioned this pull request Sep 14, 2021

Kubelet deletes the pod before cleaning the CNI resources #88543

Closed

ehashman mentioned this pull request Sep 30, 2021

Pods go into error state after kubelet-initiated eviction #105358

Closed

bobbypage mentioned this pull request Nov 4, 2021

Fixes for graceful node shutdown test #106108

Merged

wzshiming mentioned this pull request Feb 9, 2022

Allowed shortened grace period for pods in Kubelet #98507

Closed

pacoxu mentioned this pull request Feb 23, 2022

Pod Lifecycle 容器组生命周期 pacoxu/AI-Infra#1

Closed

breunigs mentioned this pull request May 12, 2022

Pods with failed status IP address reused on new pods, but traffic still going to old pods across namespaces. #109414

Closed

tcolgate mentioned this pull request May 12, 2022

"Terminated" pod on shutdown node listed in service edpoints. #109718

Closed

tnqn mentioned this pull request May 24, 2022

[e2e] Support K8s>=1.20 antrea-io/antrea#3802

Merged

rptaylor mentioned this pull request Aug 26, 2022

IPAM old pod IPs not released! - IP pool is overflowing. projectcalico/calico#5339

Closed

dashpole mentioned this pull request Nov 3, 2022

Remove unused podworker context #113566

Closed

pacoxu mentioned this pull request Jan 5, 2023

The second time the pod is deleted the grace period does not take effect #113883

Closed

payall4u mentioned this pull request May 5, 2023

Deletion of pod with prestop hook takes longer than expected. #117798

Closed

lubronzhan mentioned this pull request Apr 19, 2024

Newly created Pod doesn't get ARP response for 169.254.1.1 for 1 minute then succeeds projectcalico/calico#8689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker #102344

Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker #102344

smarterclayton commented May 26, 2021 •

edited by mrbobbytables

Loading

Uh oh!

smarterclayton commented May 26, 2021

Uh oh!

ehashman commented May 26, 2021

Uh oh!

msau42 commented May 26, 2021

Uh oh!

ehashman commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker #102344

Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker #102344

Conversation

smarterclayton commented May 26, 2021 • edited by mrbobbytables Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

smarterclayton commented May 26, 2021

Uh oh!

ehashman commented May 26, 2021

Uh oh!

msau42 commented May 26, 2021

Uh oh!

ehashman commented May 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

smarterclayton commented May 26, 2021 •

edited by mrbobbytables

Loading