Init container hangs waiting for device-plugin.config label

We’ve encountered a recurring issue where the init container in the nvidia-device-plugin (and similarly in the mps-control-daemon) becomes stuck in the Running state and never completes. This prevents the main container from starting.

The last log line observed before the hang is (for the device-plugin):
```
nvidia-device-plugin-init W1127 22:37:28.837842       7 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
nvidia-device-plugin-init I1127 22:37:28.838748       7 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
```
The last log line observed before the hang is (for the mps-control-daemon):
```
mps-control-daemon-mounts I1127 18:28:06.816180       7 main.go:81] "NVIDIA MPS Control Daemon" version=<
mps-control-daemon-mounts     e0a461e1
mps-control-daemon-mounts     commit: e0a461e1e7ad1d239d4708c954f08c3038e2654a
mps-control-daemon-mounts  >
mps-control-daemon-mounts W1127 18:28:06.818457       7 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /mps/shm
mps-control-daemon-init W1127 18:28:14.816030      20 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
mps-control-daemon-init I1127 18:28:14.816535      20 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
```

Restarting the affected pod immediately resolves the issue.

Observed Behavior
* The init container remains stuck indefinitely.
* The main container does not start.
* After restarting the pod, it proceeds normally.

Environment Details
* Component: nvidia-device-plugin (and mps-control-daemon)
* Kubernetes version: v1.32.9 EKS
* Driver / Toolkit version: v0.17.3
* Config override: Not overriding config per node. 
* Deployed via helm chart
* General config:
```
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 5
```

Frequency
* Occurs on approximately 1–3% of nodes.
* Appears to be non-deterministic, potentially a race condition.

Code Reference
Code reference for where the hang occurs:
https://github.com/NVIDIA/k8s-device-plugin/blob/624b771e02e9c2f14bfafe1d3433ebb4250aa9cb/cmd/config-manager/main.go#L229-L255
From initial review, the issue seems to occur while waiting to acquire the lock.
We considered setting ONESHOT=false, but since it appears to hang during lock acquisition, it may not resolve the issue.

	func start(c cli.Context, f Flags) error {
	kubeconfig, err := clientcmd.BuildConfigFromFlags("", f.Kubeconfig)
	if err != nil {
	return fmt.Errorf("error building kubernetes clientcmd config: %s", err)
	}

	clientset, err := kubernetes.NewForConfig(kubeconfig)
	if err != nil {
	return fmt.Errorf("error building kubernetes clientset from config: %s", err)
	}

	config := NewSyncableConfig(f)

	stop := continuouslySyncConfigChanges(clientset, config, f)
	defer close(stop)

	for {
	klog.Infof("Waiting for change to '%s' label", f.NodeLabel)
	config := config.Get()
	klog.Infof("Label change detected: %s=%s", f.NodeLabel, config)
	err := updateConfig(config, f)
	if f.Oneshot \|\| err != nil {
	return err
	}
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init container hangs waiting for device-plugin.config label #1540

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Init container hangs waiting for device-plugin.config label #1540

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions