Skip to content

Conversation

@uristernik
Copy link

@uristernik uristernik commented Nov 30, 2025

Summary

Fixes a race condition in the config-manager that causes it to hang indefinitely when the nvidia.com/device-plugin.config label is not set on the node.

Problem

When the node label is not configured, there's a timing-dependent race condition:

  1. If the Kubernetes informer's AddFunc fires before the first Get() call, it sets current="" and broadcasts
  2. When Get() is subsequently called, it finds lastRead == current (both empty strings) and waits on the condition variable
  3. No future events wake it up since the label remains unset, causing a permanent hang

This manifests as the init container hanging after printing:

Waiting for change to 'nvidia.com/device-plugin.config' label

Solution

Added an initialized boolean flag to SyncableConfig to track whether Get() has been called at least once. The first Get() call now returns immediately with the current value, avoiding the deadlock. Subsequent Get() calls continue to wait properly when the value hasn't changed.

Fixes #1540

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

When the node label (nvidia.com/device-plugin.config) is not set, a race
condition could cause the config-manager to hang indefinitely on startup.

The issue occurred when the informer's AddFunc fired before the first Get()
call, setting current="" and broadcasting. When Get() was subsequently called,
it found lastRead == current (both empty strings) and waited forever, as no
future events would wake it up.

This fix adds an 'initialized' flag to SyncableConfig to ensure the first
Get() call never waits, regardless of timing. Subsequent Get() calls still
wait properly when the value hasn't changed.

Signed-off-by: Uri Sternik <uri.sternik@wiz.io>
@uristernik uristernik force-pushed the fix-config-manager-race-condition branch from 0e9c0ac to 1a931be Compare November 30, 2025 13:46
@elezar
Copy link
Member

elezar commented Dec 1, 2025

@jgehrcke do we do something similar in the k8s-dra-driver? If so, how do we handle the initial synchronization there?

m.mutex.Lock()
defer m.mutex.Unlock()
m.current = value
m.cond.Broadcast()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the mig-manager we have a conditional broadcast (https://github.com/NVIDIA/mig-parted/blob/a56cd122e899778996f7488012058056100d091f/cmd/nvidia-mig-manager/main.go#L101-L103). Would that address the race that you are trying to fix here, or does the mig-manager suffer from the same possible deadlock?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will still suffer the same possible deadlock. If I understand the scenario correctly, the deadlock happens when the "first" broadcast fires before the other go routine waits and listens for the broadcast. The condition here would skip the broadcast but in the described scenario no one listens to that broadcast anyways.

When it happens the only ways to release the waiting go routine is by:

  1. labeling the node with some dummy config (and then deleting the label)
  2. killing the pod

@uristernik
Copy link
Author

Hey @elezar, did you get a chance to have a look?

@uristernik
Copy link
Author

Gentle ping here @elezar this is happening quite a lot and requires manual intervention each time

@elezar
Copy link
Member

elezar commented Dec 5, 2025

@uristernik we are reviewing. Thanks for your patience.

Signed-off-by: Uri Sternik <uri.sternik@wiz.io>
@uristernik uristernik force-pushed the fix-config-manager-race-condition branch from fdfff08 to 403778c Compare December 7, 2025 08:13
Copy link

@jojimt jojimt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@uristernik
Copy link
Author

Quick update here @elezar @jgehrcke, I am running a forked version with this commit and we are not seeing the issue reproduce. We are running at any given time anywhere between 200-550 GPU nodes.
CleanShot 2025-12-14 at 08 37 08@2x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Init container hangs waiting for device-plugin.config label

3 participants