Skip to content

Fix live restore for IPv6-only and multiple gateway endpoints#49150

Merged
robmry merged 5 commits intomoby:masterfrom
robmry:live_restore_fixes
Jan 6, 2025
Merged

Fix live restore for IPv6-only and multiple gateway endpoints#49150
robmry merged 5 commits intomoby:masterfrom
robmry:live_restore_fixes

Conversation

@robmry
Copy link
Copy Markdown
Contributor

@robmry robmry commented Dec 19, 2024

- What I did

On live-restore, the Sandbox tries to restore state in the osSbox by telling it about interface, routes, and gateways that would have been set up by the previous incarnation of the daemon.

Restoring gateways has been broken since commit 1832774 (Allow separate IPv4/IPv6 gateway endpoints) ... which didn't properly deal with searching for the "dstName" of an interface based on its IPv6 address.

- How I did it

Split the osSbox restore into three parts:

  • Restore the interfaces, including finding the "dstName".
  • Restore routes, unchanged, they're just a copy of the sandbox's StaticRoutes
  • Restore gateway info ...
    • If the Sandbox's gateway endpoint has an IP address (v4 or v6, depending on which addr family/families it's acting as the gateway for), remember that.
    • Otherwise, the default route is bound to the interface, so remember that.

- How to verify it

Manually ...

  • Enabled live restore.
  • Started a container connected to a bridge network and an ipvlan-l3...
    • docker network create -d ipvlan -o parent=eth0 -o ipvlan_mode=l3 --ipv6 ipv
    • docker network create --ipv6 b46
    • docker run --rm -d --network name=ipv,gw-priority=1 --network b46 --name c1 alpine sleep infinity
  • Restarted the daemon.
    • connected and disconnected the ipvlan network, checked that the default route was updated properly.

Also, repeat test TestMixL3IPVlanAndBridge with a live-restore-enabled daemon restart before running the network connects/disconnects - and add an IPv6-only network. Without the fixes in this PR, gateways/default routes could not be removed when an Endpoint leaves the Sandbox, and the IPv6-only network couldn't become a gateway because the dstName of its interface hadn't been restored correctly.

- Description for the changelog

@robmry robmry self-assigned this Dec 19, 2024
@robmry robmry added this to the 28.0.0 milestone Dec 19, 2024
@robmry robmry added area/networking Networking kind/bugfix PR's that fix bugs labels Dec 19, 2024
@robmry robmry marked this pull request as ready for review December 20, 2024 14:14
@robmry robmry marked this pull request as draft December 20, 2024 14:20
@robmry robmry force-pushed the live_restore_fixes branch from 1f9eeee to fb0b8b0 Compare January 6, 2025 16:38
@robmry robmry marked this pull request as ready for review January 6, 2025 16:39
@robmry robmry requested a review from akerouanton January 6, 2025 16:39
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

})
assert.Check(t, defFound, "default route %q not found:\n%s", expDefRoute, strings.Join(routes, "\n"))
if expDefRoute == "" {
defFound := slices.ContainsFunc(routes, func(s string) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it didn't complain before this, but probably because it's a _test.go - wondering if we should technically have such //go:build tags for these as well

(no need to do in this PR; looks like it was already there)

Copy link
Copy Markdown
Member

@akerouanton akerouanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small nit, otherwise LGTM.

id: sbs.ID,
controller: sbs.c,
containerID: sbs.Cid,
epPriority: sbs.EpPriority,
Copy link
Copy Markdown
Member

@akerouanton akerouanton Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this method is badly named -- it does more than cleaning up old sandboxes. Maybe we could come up with a more explicit name, or add a comment stating what it's used for? (either in this PR if you have to update it, or in a follow-up)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to sandboxRestore, with comment "sandboxRestore restores Sandbox objects from the store, deleting them if they're not active." ... I think that covers it?

Comment thread integration/network/network_linux_test.go
robmry added 5 commits January 6, 2025 18:11
Signed-off-by: Rob Murray <[email protected]>
On live-restore, the Sandbox tries to restore state in the osSbox
by telling it about interface, routes, and gateways that would
have been set up by the previous incarnation of the daemon.

Restoring gateways has been broken since commit 1832774 (Allow
separate IPv4/IPv6 gateway endpoints.) ... which didn't properly
deal with searching for the "dstName" of an interface based on its
IPv6 address.

This commit fixes that, and splits the osSbox restore into three
parts:
- Restore the interfaces, including finding the "dstName".
- Restore routes, unchanged, they're just a copy of the sandbox's
  StaticRoutes
- Restore gateway info - if the Sandbox's gateway endpoint has an
  IP address (v4 or v6, depending on which addr family/families it's
  acting as the gateway for), store that. If not, the default route
  is bound to the interface, so store that.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the live_restore_fixes branch from fb0b8b0 to 39c0517 Compare January 6, 2025 18:16
@robmry
Copy link
Copy Markdown
Contributor Author

robmry commented Jan 6, 2025

Unrelated failure in CI ...

=== FAIL: amd64.integration.container TestWaitRestartedContainer/next-exit (5.76s)
    wait_test.go:232: assertion failed: 5 (expectedCode int64) != 137 (waitRes.StatusCode int64)
    --- FAIL: TestWaitRestartedContainer/next-exit (5.76s)

@robmry robmry merged commit 9e3e429 into moby:master Jan 6, 2025
@robmry robmry deleted the live_restore_fixes branch February 26, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking Networking kind/bugfix PR's that fix bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants