Skip to content

libnetwork, libn/d/overlay: handle coalesced updates from NetworkDB#50393

Merged
corhere merged 6 commits intomoby:masterfrom
corhere:libn/handle-coalesced-updates
Jul 23, 2025
Merged

libnetwork, libn/d/overlay: handle coalesced updates from NetworkDB#50393
corhere merged 6 commits intomoby:masterfrom
corhere:libn/handle-coalesced-updates

Conversation

@corhere
Copy link
Copy Markdown
Contributor

@corhere corhere commented Jul 11, 2025

- What I did

- How I did it
The eventually-consistent nature of NetworkDB means we cannot depend on events being received in the same order that they were sent. Nor can we depend on receiving events for all intermediate states. It is possible for a series of entry UPDATEs, or a DELETE followed by a CREATE with the same key, to get coalesced into a single UPDATE event on the receiving node. Watchers of NetworkDB tables therefore need to be prepared to gracefully handle arbitrary UPDATEs of a key, including those where the new value may have nothing in common with the previous value.

Both the overlay driver's watcher for overlay_peer_table and the libnetwork controller's watcher for endpoint_table would misbehave when processing a coalesced UPDATE event. They would handle it the same as a CREATE: insert new entries into control-plane databases without removing the old. (This is a simplified explanation.) The result would be "transient states" with conflicts that never resolve, leaving obsolete entries in kernel control-plane databases and service-discovery DNS. Fix the watchers to remove the old before adding the new when handling UPDATE events.

- How to verify it

  1. Set up a three-node Swarm with two workers and one manager
  2. Create a user-defined, attachable overlay network
  3. On the worker nodes, set up a simulated lossy underlay network using the tc netem scheduler. E.g.: tc qdisc add dev eth0 root netem delay 100ms 200ms loss 1%
  4. Verify that service-discovery DNS stays in sync:
    1. Deploy a globally-replicated service with constraint node.role!=manager which is connected to the user-defined overlay network from step 2. Arrange it so the tasks exit after a few seconds, e.g. alpine sh -c 'sleep 5'
    2. Run a basic container on the manager node, connected to the overlay network from step 2. Repeatedly make DNS queries for tasks.<service from step 4> while the service tasks keep flapping in the background. Verify that the set of IP addresses in the answer does not grow unbounded.
    3. Verify that the the daemon logs on all three nodes show no evidence of persistent transient states
    4. Delete the basic container and service
  5. Verify that service-discovery DNS and overlay FDB do not fall out of sync when the same endpoint ID is repeatedly removed and re-added:
    1. On the manager node, docker run -d --name hellobasic --net my-overlay nginxdemos/hello:plain-text
    2. On one of the worker nodes, docker run --rm --net my-overlay alpine watch wget -qO- hellobasic
    3. Back on the manager, repeatedly for run in {1..3}; do docker kill hellobasic && docker start hellobasic; done so the endpoint flaps and gets reattached with different IP and MAC addresses.
    4. Verify that the worker node's container is eventually able to communicate with the hellobasic container after each restart.
    5. Delete the containers
  6. Verify that the ingress routing mesh is reliable with flapping tasks:
    1. docker service create -d --name hello --network my-overlay --mode global --constraint 'node.role!=manager' -p 30888:80 nginxdemos/hello:plain-text
    2. curl port 30888 on each node, repeatedly. Verify that repeated requests all succeed, and hit different tasks.
    3. for run in {1..10}; do docker service update --force hello; sleep 15; done
    4. Repeat step ii.

- Human readable description for the release notes

- Greatly improve the reliability of overlay networking and the Swarm routing mesh

- A picture of a cute animal (not mandatory but encouraged)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the legacy Create/Update/Delete event types in NetworkDB with a unified WatchEvent that carries both previous and current values, and updates all watchers and driver interfaces to handle coalesced updates correctly.

  • Introduce WatchEvent with Prev/Value fields and helper methods.
  • Update all broadcaster and watcher code (NetworkDB, drivers, agent) to emit and consume WatchEvent.
  • Remove old stub EventNotify/DecodeTableEntry implementations and adapt tests to use WatchEvent.

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
libnetwork/networkdb/watch.go Define WatchEvent, update Watch filtering
libnetwork/networkdb/*.go Refactor broadcasters and handlers to use WatchEvent
libnetwork/driverapi/driverapi.go Added TableWatcher interface for coalesced events
libnetwork/drivers/... Remove old stubs, implement new EventNotify
libnetwork/networkdb/tableevent_test.go Update tests to expect WatchEvent
libnetwork/networkdb/networkdb_test.go Adapt testWatch helper to validate Prev/Value
libnetwork/agent.go Integrate TableWatcher events in Services and handlers
internal/iterutil/iterutil.go Add SameValues/Deref utilities
internal/iterutil/iterutil_test.go New tests for SameValues/Deref
Comments suppressed due to low confidence (1)

case DeleteEvent:
evt = event(ev)
}
evt := ev.(WatchEvent)
Copy link

Copilot AI Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unguarded type assertion ev.(WatchEvent) can panic if a non-WatchEvent is received; consider using a safe assertion (e, ok := ev.(WatchEvent)) or a type switch to prevent runtime panics.

Suggested change
evt := ev.(WatchEvent)
evt, ok := ev.(WatchEvent)
if !ok {
return false
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. I want it to fail fast and loudly if events of the wrong type are received, as that would be a bug

Comment thread libnetwork/agent.go
} else {
ep.info = info
eps[epID] = ep
if d, ok := d.(driverapi.TableWatcher); ok {

This comment was marked as outdated.

func TestDeref(t *testing.T) {
a := make([]*int, 3)
for i := range a {
a[i] = &i

This comment was marked as outdated.

@thaJeztah
Copy link
Copy Markdown
Member

Looks like there's a compile failure on Windows at least;

42.33 # github.com/docker/docker/libnetwork/drivers/windows
42.33 libnetwork/drivers/windows/windows.go:308:46: undefined: driverapi.EventType
60.13 + rm -f /go/src/github.com/docker/docker/go.mod

@corhere corhere force-pushed the libn/handle-coalesced-updates branch 4 times, most recently from e8c508f to 739e533 Compare July 15, 2025 22:05
Copy link
Copy Markdown
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread daemon/libnetwork/agent.go Outdated
switch ev.(type) {
case networkdb.CreateEvent, networkdb.UpdateEvent:
switch {
case event.Value != nil:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the event.IsX methods might make this clearer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in retrospect I agree. I'll save this sort of construct for the following commit where the implementation gets corrected.

Copy link
Copy Markdown
Contributor

@austinvazquez austinvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The networking stack is new to me, but 👍🏼 from a Go/delta POV. No blockers from me.

Comment on lines +14 to +15
_ map[MACAddr]bool
_ map[IPMAC]bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨🏼‍🍳

Copy link
Copy Markdown
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your 2nd commit, introducing TableWatcher, you added an interface assertion to the Linux overlay driver, but not to the Windows overlay driver. Is this on purpose?

Otherwise LGTM

@corhere
Copy link
Copy Markdown
Contributor Author

corhere commented Jul 22, 2025

In your 2nd commit, introducing TableWatcher, you added an interface assertion to the Linux overlay driver, but not to the Windows overlay driver. Is this on purpose?

Nope, that's an oversight. Nice catch!

corhere added 6 commits July 22, 2025 11:49
When handling updates to existing entries, it is often necessary to know
what the previous value was. NetworkDB knows the previous and new values
when it broadcasts an update event for an entry. Include both values in
the update event so the watchers do not have to do their own parallel
bookkeeping.

Unify the event types under WatchEvent as representing the operation kind
in the type system has been inconvenient, not useful. The operation is
now implied by the nilness of the Value and Prev event fields.

Signed-off-by: Cory Snider <[email protected]>
Overlay is the only driver which makes use of the EventNotify facility,
yet all other driver implementations are forced to provide a stub
implementation. Move the EventNotify and DecodeTableEntry methods into a
new optional TableWatcher interface and remove the stubs from all the
other drivers.

Signed-off-by: Cory Snider <[email protected]>
The macAddr and ipmac types are generally useful within libnetwork. Move
them to a dedicated package and overhaul the API to be more like that of
the net/netip package.

Update the overlay driver to utilize these types, adapting to the new
API.

Signed-off-by: Cory Snider <[email protected]>
Windows and Linux overlay driver instances are interoperable, working
from the same NetworkDB table for peer discovery. As both drivers
produce and consume serialized data through the table, they both need to
have a shared understanding of the shape and semantics of that data.
The Windows overlay driver contains a duplicate copy of the protobuf
definitions used for marshaling and unmarshaling the NetworkDB peer
entries for dubious reasons. It gives us the flexibility to have the
definitions diverge, which is only really useful for shooting ourselves
in the foot.

Make daemon/libnetwork/drivers/overlay the source of truth for the peer
record definitions and the name of the NetworkDB table for distributing
peer records.

Signed-off-by: Cory Snider <[email protected]>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.

The overlay driver naively handled events for overlay_peer_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, inserting a new
entry into peerDB without removing the old one. This would
have various side effects, such as having the "transient state" of
multiple entries in peerDB with the same peer IP never settle.

Update driverapi to pass both the previous and new value of a table
entry into the driver. Modify the overlay driver to handle an UPDATE by
removing the previous peer entry from peerDB then adding the new one.
Modify the Windows overlay driver to match.

Signed-off-by: Cory Snider <[email protected]>
The eventually-consistent nature of NetworkDB means we cannot depend on
events being received in the same order that they were sent. Nor can we
depend on receiving events for all intermediate states. It is possible
for a series of entry UPDATEs, or a DELETE followed by a CREATE with the
same key, to get coalesced into a single UPDATE event on the receiving
node. Watchers of NetworkDB tables therefore need to be prepared to
gracefully handle arbitrary UPDATEs of a key, including those where the
new value may have nothing in common with the previous value.

The libnetwork controller naively handled events for endpoint_table
assuming that an endpoint leave followed by a rejoin of the same
endpoint would always be expressed as a DELETE event followed by a
CREATE. It would handle a coalesced UPDATE as a CREATE, adding a new
service binding without removing the old one. This would
have various side effects, such as having the "transient state" of
having multiple conflicting service bindings where more than one
endpoint is assigned an IP address never settling.

Modify the libnetwork controller to handle an UPDATE by removing the
previous service binding then adding the new one.

Signed-off-by: Cory Snider <[email protected]>
@corhere corhere force-pushed the libn/handle-coalesced-updates branch from f147ed8 to 4538a1d Compare July 22, 2025 15:51
@corhere corhere merged commit 3c8ba15 into moby:master Jul 23, 2025
175 of 176 checks passed
@corhere corhere deleted the libn/handle-coalesced-updates branch July 23, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

FDB for overlay network link falls out of sync with NetworkDB state dockerd produces 180GB syslog file in one night

8 participants