Skip to content

node outOfSync state cause deadlock and block pod scheduled #2999

@Monokaix

Description

@Monokaix

What happened:
After the pod is scheduled to the node, when the allocated resources of the node decrease due to some reasons (such as an exception reported by the gpu device), the node will be set to outofsync state, and node will not be added to the session, new pod cannot be scheduled to the current node until the allocatable resources reported by the node become normal ,even if there are other idle resources on the node, the pod cannot be scheduled. If the pod is used to report gpu resources, the premise of the pod being scheduled is that the node ends the outofsync state , and the end of outofsync requires the gpu device to be scheduled and report resources correctly, which causes a deadlock
What you expected to happen:
pod used to report gpu resource should be scheduled even though node is in outOfSync state.
How to reproduce it (as minimally and precisely as possible):

  1. Run a device-plugin daemonset to report gpu resource
  2. Run a pod using gpu resource
  3. Uninstall device-plugin daemonset and wait gpu resource of node allocatable become zero
  4. Re-deploy device-plugin daemonset, one daemonset pod can't be scheduled

Anything else we need to know?:

Environment:

  • Volcano Version: latest
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions