Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented May 26, 2021

A number of race conditions exist when pods are terminated early in their lifecycle because components in the kubelet need to know "no running containers" or "containers can't be started from now on" but were relying on outdated state (they don't know whether the pod is setting up, tearing down, or can never be started again).

Only the pod worker knows whether containers are being started for a given pod, which is required to know when a pod is "terminated" (no running containers, none coming). Move that responsibility and podKiller function into the pod workers, and have everything that was killing the pod go into the UpdatePod loop. Have pod workers remain running until the pod is ready to be deleted in etcd (PodResourcesAreReclaimed would return true). Split syncPod into three phases - setup, terminate containers, and cleanup pod - and have transitions between those methods be visible to other components. After this change, to kill a pod you tell the pod worker to UpdatePod({UpdateType: SyncPodKill, Pod: pod}).

Several places in the kubelet were incorrect about whether they were handling terminating (should stop running, might have
containers) or terminated (no running containers) pods. The pod worker exposes methods that allow other loops to know when to set up or tear down resources based on the state of the pod - these methods remove the possibility of race conditions by ensuring a single component is responsible for knowing each pod's allowed state and other components
simply delegate to checking whether they are in the window by UID.

In addition, removing containers now no longer blocks final pod deletion in the API server and are handled as background cleanup.

See https://docs.google.com/document/d/1DvAmqp9CV8i4_zYvNdDWZW-V0FMk1NbZ8BAOvLRD3Ds/edit# for details

TODO:

  • Implement context cancellation on syncPod and terminatingPod when necessary
  • Add an e2e test like "pod submit and delete" with init containers Add init container pod deletion test #103128
  • Add an e2e test that verifies that logs remain and are accessible until the pod object is deleted (set a 15s grace period, write a log message every second)
  • Thoroughly review that no regressions in cleanup are present - setup in loops should use the right conditions, teardown should also use them
  • Improve pod sync latency by having pod worker track the latency

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a number of race conditions in the kubelet when pods are starting up or shutting down that might cause pods to take a long time to shut down.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels May 26, 2021
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. approved Indicates a PR has been approved by an approver from all required OWNERS files. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 26, 2021
@smarterclayton
Copy link
Contributor Author

/priority critical-urgent

Race conditions in the kubelet mean that quickly deleted pods can leave dangling resources for significant amounts of time (although this is enough of a change that we shouldn't rush to merge it until we're confident it's both correct, safe, and improves testing overall)

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 26, 2021
@smarterclayton smarterclayton changed the title Keep pod worker running until pod is truly complete Prevent Kubelet from incorrectly interpreting "not yet started" pods as "ready to terminate pods" by unifying responsibility for pod lifecycle into pod worker May 26, 2021
@ehashman
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 26, 2021
@msau42
Copy link
Member

msau42 commented May 26, 2021

/assign @gnufied @jingxu97
for volumemanager implications

If I understand correctly, this may help resolve some of the races we've seen such as #69831, #96759, #101911

@ehashman
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 26, 2021
bobbypage added a commit to bobbypage/kubernetes that referenced this pull request Nov 4, 2021
* Bump the pod status and node status update timeouts to avoid flakes
* Add a small delay after dbus restart to ensure dbus has enough time to
  restart to startup prior to sending shutdown signal
* Change check of pod being terminated by graceful shutdown. Previously,
  the pod phase was checked to see if it was `Failed` and the pod reason
  string matched. This logic needs to change after 1.22 graceful node
  shutdown change introduced in PR kubernetes#102344 which changed behavior to no
  longer put the pods into a failed phase. Instead, the test now checks
  that containers are not ready, and the pod status message and reason
  are set appropriately.

Signed-off-by: David Porter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Development

Successfully merging this pull request may close these issues.