Fix race condition in config-manager when label is unset by uristernik · Pull Request #1541 · NVIDIA/k8s-device-plugin

uristernik · 2025-11-30T13:37:19Z

Summary

Fixes a race condition in the config-manager that causes it to hang indefinitely when the nvidia.com/device-plugin.config label is not set on the node.

Problem

When the node label is not configured, there's a timing-dependent race condition:

If the Kubernetes informer's AddFunc fires before the first Get() call, it sets current="" and broadcasts
When Get() is subsequently called, it finds lastRead == current (both empty strings) and waits on the condition variable
No future events wake it up since the label remains unset, causing a permanent hang

This manifests as the init container hanging after printing:

Waiting for change to 'nvidia.com/device-plugin.config' label

Solution

Added an initialized boolean flag to SyncableConfig to track whether Get() has been called at least once. The first Get() call now returns immediately with the current value, avoiding the deadlock. Subsequent Get() calls continue to wait properly when the value hasn't changed.

Fixes #1540

copy-pr-bot · 2025-11-30T13:37:22Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

elezar · 2025-12-01T12:36:35Z

@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there?

elezar · 2025-12-01T12:39:07Z

 	m.mutex.Lock()
 	defer m.mutex.Unlock()
 	m.current = value
 	m.cond.Broadcast()


In the mig-manager we have a conditional broadcast (https://github.com/NVIDIA/mig-parted/blob/a56cd122e899778996f7488012058056100d091f/cmd/nvidia-mig-manager/main.go#L101-L103). Would that address the race that you are trying to fix here, or does the mig-manager suffer from the same possible deadlock?

I think it will still suffer the same possible deadlock. If I understand the scenario correctly, the deadlock happens when the "first" broadcast fires before the other go routine waits and listens for the broadcast. The condition here would skip the broadcast but in the described scenario no one listens to that broadcast anyways.

When it happens the only ways to release the waiting go routine is by:

labeling the node with some dummy config (and then deleting the label)

killing the pod

uristernik · 2025-12-02T20:59:05Z

Hey @elezar, did you get a chance to have a look?

uristernik · 2025-12-03T14:32:16Z

Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time

elezar · 2025-12-05T09:24:08Z

@uristernik we are reviewing. Thanks for your patience.

jojimt

LGTM

uristernik · 2025-12-14T06:39:40Z

Quick update here @elezar @jgehrcke, I am running a forked version with this commit and we are not seeing the issue reproduce. We are running at any given time anywhere between 200-550 GPU nodes.

uristernik · 2025-12-21T09:48:41Z

@elezar any chance to get it merged? fix it some other way?
We are running the forked version but we don't want to manage the fork forever 🙏

uristernik · 2025-12-28T14:18:56Z

@elezar ping 🙏

uristernik · 2025-12-31T10:26:07Z

@klueska @ArangoGutierrez @cdesiniotis @RenaudWasTaken can someone please have a look on this?

uristernik · 2026-01-05T13:45:06Z

Can anyone please review this?

cdesiniotis

@uristernik thanks for your contribution (and patience!). Would you mind squashing your commits before we merge this? Thanks.

cdesiniotis · 2026-01-06T18:44:52Z

/cherry-pick release-0.18

When the node label (nvidia.com/device-plugin.config) is not set, a race condition could cause the config-manager to hang indefinitely on startup. The issue occurred when the informer's AddFunc fired before the first Get() call, setting current="" and broadcasting. When Get() was subsequently called, it found lastRead == current (both empty strings) and waited forever, as no future events would wake it up. This fix adds an 'initialized' flag to SyncableConfig to ensure the first Get() call never waits, regardless of timing. Subsequent Get() calls still wait properly when the value hasn't changed. Signed-off-by: Uri Sternik <[email protected]>

uristernik · 2026-01-07T07:02:07Z

Thank you! @cdesiniotis done!

cdesiniotis · 2026-01-07T19:50:15Z

/ok to test ab05ace

github-actions · 2026-01-07T21:59:24Z

🤖 Backport PR created for release-0.18: #1577 ✅

uristernik · 2026-01-08T06:06:15Z

Thank you @cdesiniotis.
Following the link that @elezar sent #1541 (comment) I think this fix is needed also in mig-manager, do you want me to open a pull request for that?
And can you please have a look at #1481? Today I have to patch the helm chart manually to properly deploy MPS daemonset.

Please let me know if I can help

uristernik force-pushed the fix-config-manager-race-condition branch from 0e9c0ac to 1a931be Compare November 30, 2025 13:46

uristernik mentioned this pull request Nov 30, 2025

WIP: Refactor health checks #1538

Closed

elezar reviewed Dec 1, 2025

View reviewed changes

jojimt reviewed Dec 5, 2025

View reviewed changes

Comment thread cmd/config-manager/main.go Outdated

uristernik force-pushed the fix-config-manager-race-condition branch from fdfff08 to 403778c Compare December 7, 2025 08:13

jojimt approved these changes Dec 8, 2025

View reviewed changes

cdesiniotis approved these changes Jan 6, 2026

View reviewed changes

github-actions Bot added the cherry-pick/release-0.18 label Jan 6, 2026

cdesiniotis added the bug Issue/PR to expose/discuss/fix a bug label Jan 6, 2026

cdesiniotis added this to the v0.18.2 milestone Jan 6, 2026

uristernik force-pushed the fix-config-manager-race-condition branch from 403778c to ab05ace Compare January 7, 2026 07:01

cdesiniotis merged commit 80d3a65 into NVIDIA:main Jan 7, 2026
11 checks passed

github-actions Bot mentioned this pull request Jan 7, 2026

[release-0.18] Fix race condition in config-manager when label is unset #1577

Merged

Conversation

uristernik commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Uh oh!

copy-pr-bot Bot commented Nov 30, 2025

Uh oh!

elezar commented Dec 1, 2025

Uh oh!

elezar Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

uristernik Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

uristernik commented Dec 2, 2025

Uh oh!

uristernik commented Dec 3, 2025

Uh oh!

elezar commented Dec 5, 2025

Uh oh!

Uh oh!

jojimt left a comment

Choose a reason for hiding this comment

Uh oh!

uristernik commented Dec 14, 2025

Uh oh!

uristernik commented Dec 21, 2025

Uh oh!

uristernik commented Dec 28, 2025

Uh oh!

uristernik commented Dec 31, 2025

Uh oh!

uristernik commented Jan 5, 2026

Uh oh!

cdesiniotis left a comment

Choose a reason for hiding this comment

Uh oh!

cdesiniotis commented Jan 6, 2026

Uh oh!

uristernik commented Jan 7, 2026

Uh oh!

cdesiniotis commented Jan 7, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jan 7, 2026

Uh oh!

uristernik commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uristernik commented Nov 30, 2025 •

edited

Loading