libn/networkdb: stop table events from racing network leaves

corhere · corhere · commit 270a4d41dc1a · 2025-05-15T12:57:37.000-04:00
When a node leaves a network or the cluster, or memberlist considers the
node as failed, NetworkDB atomically deletes all table entries (for the
left network) owned by the node. This maintains the invariant that table
entries owned by a node are present in the local database indices iff
that node is an active cluster member which is participating in the
network the entries pertain to.

(*NetworkDB).handleTableEvent() is written in a way which attempts to
minimize the amount of time it is in a critical section with the mutex
locked for writing. It first checks under a read-lock whether both the
local node and the node where the event originated are participating in
the network which the event pertains to. If the check passes, the mutex
is unlocked for reading and locked for writing so the local database
state is mutated in a critical section. That leaves a window of time
between the participation check the write-lock being acquired for a
network or node event to arrive and be processed. If a table event for a
node+network races a node or network event which triggers the purge of
all table entries for the same node+network, the invariant could be
violated. The table entry described by the table event may be reinserted
into the local database state after being purged by the node's leaving,
resulting in an orphaned table entry which the local node will bulk-sync
to other nodes indefinitely.

It's not completely wrong to perform a pre-flight check outside of the
critical section. It allows for an early return in the no-op case
without having to bear the cost of synchronization. But such optimistic
concurrency control is only sound if the condition is double-checked
inside the critical section. It is tricky to get right, and this
instance of optimistic concurrency control smells like a case of
premature optimization. Move the pre-flight check into the critical
section to ensure that the invariant is maintained.

Signed-off-by: Cory Snider &lt;csnider@mirantis.com&gt;
diff --git a/libnetwork/networkdb/delegate.go b/libnetwork/networkdb/delegate.go
@@ -148,8 +148,13 @@ func (nDB *NetworkDB) handleTableEvent(tEvent *TableEvent, isBulkSync bool) bool
 	// Update our local clock if the received messages has newer time.
 	nDB.tableClock.Witness(tEvent.LTime)
 
+	nDB.Lock()
+	// Hold the lock until after we broadcast the event to watchers so that
+	// the new watch receives either the synthesized event or the event we
+	// broadcast, never both.
+	defer nDB.Unlock()
+
 	// Ignore the table events for networks that are in the process of going away
-	nDB.RLock()
 	networks := nDB.networks[nDB.config.NodeID]
 	network, ok := networks[tEvent.NetworkID]
 	// Check if the owner of the event is still part of the network
@@ -161,18 +166,12 @@ func (nDB *NetworkDB) handleTableEvent(tEvent *TableEvent, isBulkSync bool) bool
 			break
 		}
 	}
-	nDB.RUnlock()
 
 	if !ok || network.leaving || !nodePresent {
 		// I'm out of the network OR the event owner is not anymore part of the network so do not propagate
 		return false
 	}
 
-	nDB.Lock()
-	// Hold the lock until after we broadcast the event to watchers so that
-	// the new watch receives either the synthesized event or the event we
-	// broadcast, never both.
-	defer nDB.Unlock()
 	var entryPresent bool
 	prev, err := nDB.getEntry(tEvent.TableName, tEvent.NetworkID, tEvent.Key)
 	if err == nil {