-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What happened:
After the pod is scheduled to the node, when the allocated resources of the node decrease due to some reasons (such as an exception reported by the gpu device), the node will be set to outofsync state, and node will not be added to the session, new pod cannot be scheduled to the current node until the allocatable resources reported by the node become normal ,even if there are other idle resources on the node, the pod cannot be scheduled. If the pod is used to report gpu resources, the premise of the pod being scheduled is that the node ends the outofsync state , and the end of outofsync requires the gpu device to be scheduled and report resources correctly, which causes a deadlock
What you expected to happen:
pod used to report gpu resource should be scheduled even though node is in outOfSync state.
How to reproduce it (as minimally and precisely as possible):
- Run a device-plugin daemonset to report gpu resource
- Run a pod using gpu resource
- Uninstall device-plugin daemonset and wait gpu resource of node allocatable become zero
- Re-deploy device-plugin daemonset, one daemonset pod can't be scheduled
Anything else we need to know?:
Environment:
- Volcano Version: latest
- Kubernetes version (use
kubectl version): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others: