Skip to content

Conversation

@jsafrane
Copy link
Member

@jsafrane jsafrane commented Mar 12, 2021

What type of PR is this?

/kind bug

What this PR does / why we need it:

Mark volume device / volume mount as uncertain after UnmountDevice / UnmountVolume operation fails, because kubelet can't be sure what's the actual state of the volume. Kubelet will then call MountDevice / SetUp to make sure the volume is fully mounted when a new pod wants to use the same volume.

Similarly, update Unmap* calls for block volumes.

Which issue(s) this PR fixes:

Fixes: #100182

Special notes for your reviewer:

  • Added e2e test for NodeUnstage errors, which exercises the UnmountDevice error path.
  • Both transient and final errors result in the mount marked as uncertain. IMO it's not safe to assume that a mount is usable even after UnmountVolume / UnmountDevice final error. Better to call SetUp / MountDevice again when the mount is needed.

Does this PR introduce a user-facing change?

Fixed starting new pods after previous pod timed out unmounting its volumes.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2021
@k8s-ci-robot
Copy link
Contributor

@jsafrane: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/test sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2021
@jsafrane jsafrane force-pushed the fix-unstage-retry branch 2 times, most recently from 3da7652 to fe7d5bd Compare March 12, 2021 13:19
@yangjunmyfm192085
Copy link
Contributor

/test pull-kubernetes-unit

@yangjunmyfm192085
Copy link
Contributor

/test pull-kubernetes-e2e-kind

@jsafrane
Copy link
Member Author

I tried to add e2e tests for the block mode, but the mock driver does not support block volumes.

@jsafrane
Copy link
Member Author

Ready for review (I tried to add e2e tests for block mode, but the mock driver does not support block)

cc @gnufied @msau42 @jingxu97 PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For InvalidArgument, do we need to mark as uncertain?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an example of a "final" response, where we can be reasonably sure that e.g. provisioning is not in progress. Not sure if we can apply the same logic to Unstage.

@jingxu97
Copy link
Contributor

/approve

@jingxu97
Copy link
Contributor

PR lgtm. just want to make sure other reviewers have a chance to take a look.

@jingxu97
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 25, 2021
@jingxu97 jingxu97 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2021
jsafrane added 6 commits June 16, 2021 18:39
When UnmountDevice fails, kubelet treat the volume mount as uncertain,
because it does not know at which stage UnmountDevice failed. It may be
already partially unmonted / destroyed.

As result, MountDevice will be performer when a new Pod is started on the
node after UnmountDevice faiure.
…ere unmounted

podVolumesExist() should consider also uncertain volumes (where kubelet
does not know if a volume was fully unmounted) when checking for pod's
volumes. Added GetPossiblyMountedVolumesForPod for that.

Adding uncertain mounts to GetMountedVolumesForPod would potentially break
other callers (e.g. `verifyVolumesMountedFunc`).
To know when a volume has been fully unmounted (incl. uncertain mounts).
Change existing desiredStateOfWorldPopulator.findAndAddNewPods tests to use
a common initialization function.
desiredStateOfWorldPopulator.findAndRemoveDeletedPods() should remove
volumes from DSW when a pod is deleted on the API server and the volume is
uncertain in ASW.
@jsafrane jsafrane force-pushed the fix-unstage-retry branch from fca9dc1 to d5da730 Compare June 16, 2021 16:42
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2021
@gnufied
Copy link
Member

gnufied commented Jun 17, 2021

/retest
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2021
@k8s-ci-robot
Copy link
Contributor

@jsafrane: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-storage-slow d5da730 link /test pull-kubernetes-e2e-gce-storage-slow

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@mrunalp
Copy link
Contributor

mrunalp commented Jun 18, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97, jsafrane, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2021
@k8s-ci-robot k8s-ci-robot merged commit 4afb72a into kubernetes:master Jun 18, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.22 milestone Jun 18, 2021
@timebertt
Copy link
Contributor

Any chance we can backport this change?
We observed #100182 multiple times on v1.20 clusters.

@ankitjain28may
Copy link

Any updates, can we get to know if this is getting backported to 1.20 since we are facing a lot of such issues causing production downtimes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

NodeStage is not called after NodeUnstage error