You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fixes#17018 potential deadlock in LocalGridModel by firing events outside the lock scope
Fix deadlock between health check and purge dead nodes threads by moving event firing outside of lock scope in LocalGridModel
Events (NodeRestartedEvent, NodeRemovedEvent) were previously fired while holding LocalGridModel.lock, causing circular lock dependency with LocalNodeRegistry.lock
Write a unit test to simulate (based on analysis & thread dump in #17018)
Thread A (node-health-check):
1. LocalNodeRegistry.updateNodeAvailability() acquires LocalNodeRegistry.lock
2. Calls model.setAvailability() → waits for LocalGridModel.lock
Thread B (Purge Dead Nodes):
1. model.purgeDeadNodes() acquires LocalGridModel.lock
2. Fires NodeRemovedEvent inside lock
3. Event listener calls LocalNodeRegistry.remove() → waits for LocalNodeRegistry.lock
This creates a circular wait condition → deadlock.
Fixed by updating LocalGridModel to fire events outside the lock scope:
purgeDeadNodes(): Collect removed nodes in a set, release lock, then fire NodeRemovedEvent for each
add(): Store restarted node reference, release lock, then fire NodeRestartedEvent if needed
🔧 Implementation Notes
💡 Additional Considerations
🔄 Types of changes
Cleanup (formatting, renaming)
Bug fix (backwards compatible)
New feature (non-breaking change which adds functionality and tests!)
Breaking change (fix or feature that would cause existing functionality to change)
Generic: Robust Error Handling and Edge Case Management
Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation
Status: Swallowed exceptions: The new test code catches broad Exception and ignores it (including interruption), which can silently hide failures and reduce debuggability.
Improve the purgeDeadNodesShouldFireEventsOutsideLock test by attempting to acquire a write lock instead of a read lock in the event listener. This provides a more accurate verification that the lock is released before the event is fired, as the current read lock check can pass even if the write lock is held.
// When event fires, try to call another model method that needs the lock
events.addListener(
NodeRemovedEvent.listener(
status -> {
try {
- // This needs the read lock - if event is fired inside write lock,- // this would work due to lock downgrading, but we verify the pattern- model.getSnapshot();+ // This needs the write lock. If the event is fired inside the+ // purgeDeadNodes's write lock, this will block and time out.+ // We use a dummy node ID since we just need to acquire the lock.+ model.setAvailability(new NodeId(UUID.randomUUID()), UP);
canAcquireLock.set(true);
} catch (Exception e) {
canAcquireLock.set(false);
}
eventFired.countDown();
}));
Apply / Chat
Suggestion importance[1-10]: 8
__
Why: The suggestion correctly identifies a flaw in the test logic where lock downgrading in ReentrantReadWriteLock could lead to a false positive, and proposes a more robust check by attempting to acquire a write lock, significantly improving the test's reliability.
Medium
Catch exceptions during removal events
Wrap the events.fire(new NodeRemovedEvent(node)) call within a try-catch block inside the loop. This prevents a single failing event listener from stopping the notification process for other removed nodes.
Why: The suggestion correctly points out that an exception in an event listener could halt the processing of subsequent node removal events, and proposes a robust solution by wrapping each event fire in a try-catch block.
Medium
Preserve event firing order
Change removedNodes from a HashSet to an ArrayList to ensure that NodeRemovedEvents are fired in a deterministic order.
-Set<NodeStatus> removedNodes = new HashSet<>();+List<NodeStatus> removedNodes = new ArrayList<>();
Apply / Chat
Suggestion importance[1-10]: 2
__
Why: The suggestion to use a List for deterministic event order is valid, but the current implementation's use of a Set is also correct and has no functional downside, as the order of NodeRemovedEvents is not critical.
Low
Learned best practice
Don’t ignore thread failures
Capture any thrown exceptions and restore the interrupt flag on interruption, then assert no exceptions occurred so the test cannot pass while threads silently failed.
Why:
Relevant best practice - Do not swallow exceptions/interrupts in concurrency tests; propagate failures and restore interrupt status to avoid false positives.
Low
Clean up executors reliably
Move shutdown into a finally block and await termination (best-effort) so threads are reliably cleaned up even when assertions fail or time out.
Why:
Relevant best practice - Ensure executor lifecycle is safely managed in tests (shutdown in finally and await termination) to prevent thread leaks and flakiness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔗 Related Issues
💥 What does this PR do?
Fixes #17018 potential deadlock in LocalGridModel by firing events outside the lock scope
LocalGridModelNodeRestartedEvent,NodeRemovedEvent) were previously fired while holdingLocalGridModel.lock, causing circular lock dependency withLocalNodeRegistry.lockWrite a unit test to simulate (based on analysis & thread dump in #17018)
Fixed by updating LocalGridModel to fire events outside the lock scope:
purgeDeadNodes(): Collect removed nodes in a set, release lock, then fireNodeRemovedEventfor eachadd(): Store restarted node reference, release lock, then fireNodeRestartedEventif needed🔧 Implementation Notes
💡 Additional Considerations
🔄 Types of changes