Skip to content

vendor: github.com/vishvananda/netlink v1.3.0#48368

Merged
thaJeztah merged 2 commits intomoby:masterfrom
thaJeztah:vendor_netlink_1.2.1
Aug 28, 2024
Merged

vendor: github.com/vishvananda/netlink v1.3.0#48368
thaJeztah merged 2 commits intomoby:masterfrom
thaJeztah:vendor_netlink_1.2.1

Conversation

@thaJeztah
Copy link
Member

@thaJeztah thaJeztah commented Aug 23, 2024

vendor: github.com/vishvananda/netlink v1.2.1

full diff: vishvananda/netlink@v1.2.1-beta.2...v1.2.1

vendor: github.com/vishvananda/netlink v1.3.0

full diff: vishvananda/netlink@v1.2.1...v1.3.0

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Member Author

Hm... 🤔

73.74 # github.com/docker/docker/libnetwork/iptables
73.74 libnetwork/iptables/conntrack.go:78:28: nlh.ConntrackDeleteFilter undefined (type *netlink.Handle has no field or method ConntrackDeleteFilter)
73.74 libnetwork/iptables/conntrack.go:84:28: nlh.ConntrackDeleteFilter undefined (type *netlink.Handle has no field or method ConntrackDeleteFilter)
73.74 libnetwork/iptables/conntrack.go:105:13: nlh.ConntrackDeleteFilter undefined (type *netlink.Handle has no field or method ConntrackDeleteFilter)

@thaJeztah
Copy link
Member Author

Looks like this issue where things broke; vishvananda/netlink@3b7e16c

That commit renamed ConntrackDeleteFilter to ConntrackDeleteFilters (plural) and changed the signature to accept multiple filters.

@thaJeztah
Copy link
Member Author

Hmm some failures on arm64 (jenkins); could be related?

=== RUN   TestSandboxAddMultiPrio
    sandbox_unix_test.go:160: failed to add interface veth00dcf18 to sandbox: error setting interface "veth00dcf18" IP to 172.18.0.2/16: cannot program address 172.18.0.2/16 in sandbox interface because it conflicts with existing route {Ifindex: 5 Dst: 0.0.0.0/0 Src: <nil> Gw: 172.17.0.1 Flags: [] Table: 254 Realm: 0}
--- FAIL: TestSandboxAddMultiPrio (0.30s)
TestSandboxAddSamePrio – github.com/docker/docker/li

@thaJeztah
Copy link
Member Author

Ugh; lots of failures. I think there's something borked in the module 😔

@akerouanton
Copy link
Member

sandbox_unix_test.go:160: failed to add interface veth00dcf18 to sandbox: error setting interface "veth00dcf18" IP to 172.18.0.2/16: cannot program address 172.18.0.2/16 in sandbox interface because it conflicts with existing route {Ifindex: 5 Dst: 0.0.0.0/0 Src: Gw: 172.17.0.1 Flags: [] Table: 254 Realm: 0}

This error message comes from here:

func checkRouteConflict(nlh *netlink.Handle, address *net.IPNet, family int) error {
routes, err := nlh.RouteList(nil, family)
if err != nil {
return err
}
for _, route := range routes {
if route.Dst != nil {
if route.Dst.Contains(address.IP) || address.Contains(route.Dst.IP) {
return fmt.Errorf("cannot program address %v in sandbox interface because it conflicts with existing route %s",
address, route)
}
}
}
return nil
}

So it barfs out because there's a default route (ie. Dst = 0.0.0.0/0). Of course, 0.0.0.0/0 contains 172.18.0.2. In the test, we join two endpoints to the same sandbox, and it fails for the 2nd endpoint:

if err := ep1.Join(context.Background(), sbx, JoinOptionPriority(1)); err != nil {
t.Fatal(err)
}
if err := ep2.Join(context.Background(), sbx, JoinOptionPriority(2)); err != nil {
t.Fatal(err)
}

So either RouteList wasn't returning default routes beforehand, or a default route is now created when the 1st endpoint joins the sandbox. I'd need to dig more but don't have much bandwidth right now (maybe next week).

@thaJeztah
Copy link
Member Author

Thanks for looking @akerouanton ! Yeah, we need to find out what's happening and if it was an intentional change.

Unfortunately the diff since the last beta is quite large 😬

At least let me /cc @aboch 👋 in case he would know, and "FYI" that something could be off

@thaJeztah
Copy link
Member Author

@akerouanton someone gave some hints at possible suspects;

This is likely because of this change: acdc658 It was made a year ago but I guess there hasn't been an actual release of the netlink library for a while. You will probably need to update the moby code to check if route.Dst is "all zeros" instead of nil.

More generally, there has been quite a lot of changes to the API since the 1.2.0 release. New functions were added but there were also some "breaking" changes, most of them more subtle than just the removal of ConntrackDeleteFilter. The behavior of some API functions was changed (which is the issue you are running into), the type of some fields was changed (vishvananda/netlink#983), etc. It seems that at the very least a new minor release (1.3.0) would have been more appropriate than a patch release (1.2.1).

@thaJeztah thaJeztah force-pushed the vendor_netlink_1.2.1 branch from 13453fa to 5eff8b8 Compare August 23, 2024 18:35
@akerouanton
Copy link
Member

The error seen on Jenkins is quite concerning, but not related to this PR (https://ci-next.docker.com/public/blue/organizations/jenkins/moby/detail/PR-48368/6/pipeline/#step-70-log-620):


[2024-08-26T09:02:11.610Z] === RUN   TestAuthZPluginAllowEventStream
[2024-08-26T09:02:12.539Z] 2024/08/26 09:02:12 http: panic serving 127.0.0.1:40248: runtime error: invalid memory address or nil pointer dereference
[2024-08-26T09:02:12.539Z] goroutine 144 [running]:
[2024-08-26T09:02:12.539Z] net/http.(*conn).serve.func1()
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:1873 +0xb0
[2024-08-26T09:02:12.539Z] panic({0xa069a0?, 0x1498560?})
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/runtime/panic.go:920 +0x26c
[2024-08-26T09:02:12.539Z] github.com/docker/docker/integration/plugin/authz.setupSuite.func3({0xcdcdc8, 0x40003f0840}, 0xffff6ab1f530?)
[2024-08-26T09:02:12.539Z] 	/go/src/github.com/docker/docker/integration/plugin/authz/main_test.go:159 +0x14c
[2024-08-26T09:02:12.539Z] net/http.HandlerFunc.ServeHTTP(0x40004e6400?, {0xcdcdc8?, 0x40003f0840?}, 0x40e884?)
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:2141 +0x38
[2024-08-26T09:02:12.539Z] net/http.(*ServeMux).ServeHTTP(0xce2cb0?, {0xcdcdc8, 0x40003f0840}, 0x40004e6400)
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:2519 +0x144
[2024-08-26T09:02:12.539Z] github.com/docker/docker/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0x40004c5130, {0xcdb988?, 0x40004ea460}, 0x40004e6300, {0xcd2700, 0x40004dc940})
[2024-08-26T09:02:12.539Z] 	/go/src/github.com/docker/docker/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/handler.go:229 +0xf44
[2024-08-26T09:02:12.539Z] github.com/docker/docker/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0xcdb988?, 0x40004ea460?}, 0x4000708570?)
[2024-08-26T09:02:12.539Z] 	/go/src/github.com/docker/docker/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/handler.go:81 +0x40
[2024-08-26T09:02:12.539Z] net/http.HandlerFunc.ServeHTTP(0x10?, {0xcdb988?, 0x40004ea460?}, 0x40004ea460?)
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:2141 +0x38
[2024-08-26T09:02:12.539Z] net/http.serverHandler.ServeHTTP({0xcd8888?}, {0xcdb988?, 0x40004ea460?}, 0x6?)
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:2943 +0xbc
[2024-08-26T09:02:12.539Z] net/http.(*conn).serve(0x400050d0e0, {0xce2cb0, 0x4000112450})
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:2014 +0x518
[2024-08-26T09:02:12.539Z] created by net/http.(*Server).Serve in goroutine 11
[2024-08-26T09:02:12.539Z] 	/usr/local/go/src/net/http/server.go:3091 +0x4cc
[2024-08-26T09:02:13.468Z] 2024/08/26 09:02:13 http: panic serving 127.0.0.1:40244: could not unmarshal json for /AuthZPlugin.AuthZRes: unexpected end of JSON input
...

@thaJeztah
Copy link
Member Author

I think I've seen that fail before. I think this is where the panic happens;

resRes := ctrl.resRes

And looking at the ctrl variable; that looks to be a global variable in the pacakge;

ctrl *authorizationController

However, it is assigned to in setupTestV1, which sets it to nil in t.Cleanup()

func setupTestV1(t *testing.T) context.Context {
ctx := setupTest(t)
ctrl = &authorizationController{}
err := os.MkdirAll("/etc/docker/plugins", 0o755)
assert.NilError(t, err)
fileName := fmt.Sprintf("/etc/docker/plugins/%s.spec", testAuthZPlugin)
err = os.WriteFile(fileName, []byte(server.URL), 0o644)
assert.NilError(t, err)
t.Cleanup(func() {
err := os.RemoveAll("/etc/docker/plugins")
assert.NilError(t, err)
ctrl = nil
})
return ctx
}

And ... setupTestV1 is invoked in various tests .. I suspect there's a race condition there, where t.Cleanup is not completed yet when the next test runs, and therefore setting it to nil?

@thaJeztah
Copy link
Member Author

I see e8dc902 changed it from a defer to a t.Cleanup - so possibly could've introduced this? (assuming that t.Cleanup is not guaranteed to be executed before the next test starts?

@thaJeztah thaJeztah force-pushed the vendor_netlink_1.2.1 branch from 5eff8b8 to 2c498c6 Compare August 26, 2024 10:33
@thaJeztah
Copy link
Member Author

v1.3.0 was released; I pushed an extra commit to update it to that version;

vendor: github.com/vishvananda/netlink v1.3.0

full diff: vishvananda/netlink@v1.2.1...v1.3.0

@thaJeztah thaJeztah changed the title vendor: github.com/vishvananda/netlink v1.2.1 vendor: github.com/vishvananda/netlink v1.3.0 Aug 26, 2024
@thaJeztah
Copy link
Member Author

Seeing some t.Skip - looks like they're expected to fail, but curious if there's a relation with this update, so just posing "FYI"

=== Skipped
=== SKIP: amd64.integration.network TestIPRangeAt64BitLimit/ipRange_at_end_of_64-bit_subnet (0.19s)
    bridge_test.go:196: XFAIL: Container startup failed with error: Error response from daemon: no available IPv6 addresses on this network's address pools: test64bl (2c10e675625f2c85877947c358c78cbfc4a14d71948d3b9b87dbeac758450c2b)
    --- SKIP: TestIPRangeAt64BitLimit/ipRange_at_end_of_64-bit_subnet (0.19s)

=== SKIP: amd64.integration.network TestIPRangeAt64BitLimit/ipRange_at_64-bit_boundary_inside_56-bit_subnet (0.19s)
    bridge_test.go:196: XFAIL: Container startup failed with error: Error response from daemon: no available IPv6 addresses on this network's address pools: test64bl (4a1b65023efa7febf1de341dad0995a5e7078aa9a5f84d19feb73c3434e1f869)
    --- SKIP: TestIPRangeAt64BitLimit/ipRange_at_64-bit_boundary_inside_56-bit_subnet (0.19s)

@akerouanton
Copy link
Member

They're both expected to fail, but not sure if that's the exact error message we should get. cc @robmry

@robmry
Copy link
Contributor

robmry commented Aug 27, 2024

They're both expected to fail, but not sure if that's the exact error message we should get. cc @robmry

Yes, "no available IPv6 addresses on this network's address pools" is expected, but wrong - there should be an address available. ("test64bl" is just the name of the network, with its id.)

@thaJeztah
Copy link
Member Author

The error seen on Jenkins is quite concerning, but not related to this PR (https://ci-next.docker.com/public/blue/organizations/jenkins/moby/detail/PR-48368/6/pipeline/#step-70-log-620):

I think I've seen that fail before. I think this is where the panic happens;

LOL, so I went to create a ticket for that test, but turns out I already did some time ago;

@thaJeztah
Copy link
Member Author

@robmry so, this PR is good to go, or is there something to look into?

@robmry
Copy link
Contributor

robmry commented Aug 28, 2024

@robmry so, this PR is good to go, or is there something to look into?

Oh, sorry - I didn't realise you'd already fixed the nil vs. unspecified thing ... I don't know of anything else.

Copy link
Contributor

@robmry robmry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member Author

Thanks! I'll bring this one in to also have some burn-in time in master 😄 (in case there's unexpected changes still hiding)

@thaJeztah
Copy link
Member Author

thaJeztah commented Sep 27, 2024

For discoverability: this updated introduced various regressions; some fixes on our side are in the PR below;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants