We’ve encountered a recurring issue where the init container in the nvidia-device-plugin (and similarly in the mps-control-daemon) becomes stuck in the Running state and never completes. This prevents the main container from starting.
The last log line observed before the hang is (for the device-plugin):
nvidia-device-plugin-init W1127 22:37:28.837842 7 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
nvidia-device-plugin-init I1127 22:37:28.838748 7 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
The last log line observed before the hang is (for the mps-control-daemon):
mps-control-daemon-mounts I1127 18:28:06.816180 7 main.go:81] "NVIDIA MPS Control Daemon" version=<
mps-control-daemon-mounts e0a461e1
mps-control-daemon-mounts commit: e0a461e1e7ad1d239d4708c954f08c3038e2654a
mps-control-daemon-mounts >
mps-control-daemon-mounts W1127 18:28:06.818457 7 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /mps/shm
mps-control-daemon-init W1127 18:28:14.816030 20 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
mps-control-daemon-init I1127 18:28:14.816535 20 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
Restarting the affected pod immediately resolves the issue.
Observed Behavior
- The init container remains stuck indefinitely.
- The main container does not start.
- After restarting the pod, it proceeds normally.
Environment Details
- Component: nvidia-device-plugin (and mps-control-daemon)
- Kubernetes version: v1.32.9 EKS
- Driver / Toolkit version: v0.17.3
- Config override: Not overriding config per node.
- Deployed via helm chart
- General config:
config:
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
mps:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 5
Frequency
- Occurs on approximately 1–3% of nodes.
- Appears to be non-deterministic, potentially a race condition.
Code Reference
Code reference for where the hang occurs:
|
func start(c *cli.Context, f *Flags) error { |
|
kubeconfig, err := clientcmd.BuildConfigFromFlags("", f.Kubeconfig) |
|
if err != nil { |
|
return fmt.Errorf("error building kubernetes clientcmd config: %s", err) |
|
} |
|
|
|
clientset, err := kubernetes.NewForConfig(kubeconfig) |
|
if err != nil { |
|
return fmt.Errorf("error building kubernetes clientset from config: %s", err) |
|
} |
|
|
|
config := NewSyncableConfig(f) |
|
|
|
stop := continuouslySyncConfigChanges(clientset, config, f) |
|
defer close(stop) |
|
|
|
for { |
|
klog.Infof("Waiting for change to '%s' label", f.NodeLabel) |
|
config := config.Get() |
|
klog.Infof("Label change detected: %s=%s", f.NodeLabel, config) |
|
err := updateConfig(config, f) |
|
if f.Oneshot || err != nil { |
|
return err |
|
} |
|
} |
|
} |
|
|
From initial review, the issue seems to occur while waiting to acquire the lock.
We considered setting ONESHOT=false, but since it appears to hang during lock acquisition, it may not resolve the issue.
We’ve encountered a recurring issue where the init container in the nvidia-device-plugin (and similarly in the mps-control-daemon) becomes stuck in the Running state and never completes. This prevents the main container from starting.
The last log line observed before the hang is (for the device-plugin):
The last log line observed before the hang is (for the mps-control-daemon):
Restarting the affected pod immediately resolves the issue.
Observed Behavior
Environment Details
Frequency
Code Reference
Code reference for where the hang occurs:
k8s-device-plugin/cmd/config-manager/main.go
Lines 229 to 255 in 624b771
From initial review, the issue seems to occur while waiting to acquire the lock.
We considered setting ONESHOT=false, but since it appears to hang during lock acquisition, it may not resolve the issue.