Fix race condition in config-manager when label is unset#1541
Fix race condition in config-manager when label is unset#1541cdesiniotis merged 1 commit intoNVIDIA:mainfrom
Conversation
0e9c0ac to
1a931be
Compare
|
@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there? |
| m.mutex.Lock() | ||
| defer m.mutex.Unlock() | ||
| m.current = value | ||
| m.cond.Broadcast() |
There was a problem hiding this comment.
In the mig-manager we have a conditional broadcast (https://github.com/NVIDIA/mig-parted/blob/a56cd122e899778996f7488012058056100d091f/cmd/nvidia-mig-manager/main.go#L101-L103). Would that address the race that you are trying to fix here, or does the mig-manager suffer from the same possible deadlock?
There was a problem hiding this comment.
I think it will still suffer the same possible deadlock. If I understand the scenario correctly, the deadlock happens when the "first" broadcast fires before the other go routine waits and listens for the broadcast. The condition here would skip the broadcast but in the described scenario no one listens to that broadcast anyways.
When it happens the only ways to release the waiting go routine is by:
- labeling the node with some dummy config (and then deleting the label)
- killing the pod
|
Hey @elezar, did you get a chance to have a look? |
|
Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time |
|
@uristernik we are reviewing. Thanks for your patience. |
fdfff08 to
403778c
Compare
|
@elezar any chance to get it merged? fix it some other way? |
|
@elezar ping 🙏 |
|
@klueska @ArangoGutierrez @cdesiniotis @RenaudWasTaken can someone please have a look on this? |
|
Can anyone please review this? |
cdesiniotis
left a comment
There was a problem hiding this comment.
@uristernik thanks for your contribution (and patience!). Would you mind squashing your commits before we merge this? Thanks.
|
/cherry-pick release-0.18 |
When the node label (nvidia.com/device-plugin.config) is not set, a race condition could cause the config-manager to hang indefinitely on startup. The issue occurred when the informer's AddFunc fired before the first Get() call, setting current="" and broadcasting. When Get() was subsequently called, it found lastRead == current (both empty strings) and waited forever, as no future events would wake it up. This fix adds an 'initialized' flag to SyncableConfig to ensure the first Get() call never waits, regardless of timing. Subsequent Get() calls still wait properly when the value hasn't changed. Signed-off-by: Uri Sternik <[email protected]>
403778c to
ab05ace
Compare
|
Thank you! @cdesiniotis done! |
|
/ok to test ab05ace |
|
🤖 Backport PR created for |
|
Thank you @cdesiniotis. Please let me know if I can help |

Summary
Fixes a race condition in the config-manager that causes it to hang indefinitely when the
nvidia.com/device-plugin.configlabel is not set on the node.Problem
When the node label is not configured, there's a timing-dependent race condition:
AddFuncfires before the firstGet()call, it setscurrent=""and broadcastsGet()is subsequently called, it findslastRead == current(both empty strings) and waits on the condition variableThis manifests as the init container hanging after printing:
Solution
Added an
initializedboolean flag toSyncableConfigto track whetherGet()has been called at least once. The firstGet()call now returns immediately with the current value, avoiding the deadlock. SubsequentGet()calls continue to wait properly when the value hasn't changed.Fixes #1540