Skip to content

Retry on EINTR from netlink dump calls#48407

Merged
cpuguy83 merged 2 commits intomoby:masterfrom
robmry:48400_netlink_eintr
Sep 16, 2024
Merged

Retry on EINTR from netlink dump calls#48407
cpuguy83 merged 2 commits intomoby:masterfrom
robmry:48400_netlink_eintr

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Aug 30, 2024

- What I did

A recent change to the vishvananda/netlink package (vishvananda/netlink#925 in v1.2.1) exposes NLM_F_DUMP_INTR in some netlink responses as an EINTR error return, with no data.

This has been causing failures in CI testing.

- How I did it

Retry the requests on EINTR, up to five times, before returning the error.

The limit of five is arbitrary, on most systems a single retry will be rare but, there's no guarantee that it will succeed. So, on a very busy or misbehaving system the error may still be returned. In most cases, this will lead to failure of the operation being attempted (which may lead to daemon startup failure, network initialisation failure etc).

Also added a forbidigo linting rule to spot use of un-wrapped netlink functions.

If vishvananda/netlink#1018 is accepted and merged, we'll have the option of using the inconsistent/incomplete results - either in cases where the retry is unnecessary because some version of the interesting data was returned anyway, or because retries have all failed and we want to continue with the same data (and possibly bugs) we'd have seen before the netlink package changes.

- How to verify it

CI tests pass more reliably, with no errors about interrupted calls.

- Description for the changelog

n/a

@robmry robmry marked this pull request as draft August 30, 2024 14:00
@robmry robmry force-pushed the 48400_netlink_eintr branch 3 times, most recently from 66b72ce to 0f90653 Compare September 3, 2024 13:13
@robmry robmry closed this Sep 4, 2024
@robmry robmry reopened this Sep 5, 2024
@robmry robmry force-pushed the 48400_netlink_eintr branch 9 times, most recently from 8e96013 to 7afc608 Compare September 9, 2024 14:46
@robmry robmry changed the title Experimental - test changes to netlink package Retry on EINTR from netlink dump calls Sep 9, 2024
@robmry robmry force-pushed the 48400_netlink_eintr branch from 7afc608 to 9e74044 Compare September 9, 2024 16:12
@robmry robmry self-assigned this Sep 9, 2024
@robmry robmry added area/networking Networking kind/bugfix PR's that fix bugs labels Sep 9, 2024
@robmry robmry added this to the 28.0.0 milestone Sep 9, 2024
@robmry robmry marked this pull request as ready for review September 9, 2024 16:36
"golang.org/x/sys/unix"
)

const maxAttempts = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment "why 5"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - done ... it's an arbitrary limit, just a small number that's more than one!

@robmry robmry force-pushed the 48400_netlink_eintr branch from 9e74044 to e0aec27 Compare September 9, 2024 16:50
Comment on lines 27 to 29
if err := f(); !errors.Is(err, unix.EINTR) {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly question; any other error is intentionally swallowed, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swallowed here in the retry loop, yes ... but also assigned in the caller's closure, and returned from this package from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh! I see it now 🙈

@robmry robmry force-pushed the 48400_netlink_eintr branch 2 times, most recently from f83c3d5 to 6b96a9f Compare September 13, 2024 18:17
@robmry robmry force-pushed the 48400_netlink_eintr branch from 6b96a9f to c2f4710 Compare September 13, 2024 18:27
A recent change to the vishvananda/netlink package exposes
NLM_F_DUMP_INTR in some netlink responses as an EINTR (with
no data).

Retry the requests when that happens, up to five times, before
returning the error. The limit of five is arbitrary, on most
systems a single retry will be rare but, there's no guarantee
that a retry will succeed. So, on a very busy or misbehaving
system the error may still be returned. In most cases, this
will lead to failure of the operation being attempted (which
may lead to daemon startup failure, network initialisation
failure etc).

Signed-off-by: Rob Murray <[email protected]>
Spot netlink functions that may return EINTR because
network configuration changed during a state dump, and
point at the nlutil wrappers.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the 48400_netlink_eintr branch from c2f4710 to edaa0eb Compare September 15, 2024 11:29
Comment on lines +35 to +38
func NewHandle(nlFamilies ...int) (Handle, error) {
nlh, err := netlink.NewHandle(nlFamilies...)
if err != nil {
return Handle{}, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious; should we return a pointer here (and nil on error), similar to how the thing we wrap returns a pointer?

func NewHandle(nlFamilies ...int) (*Handle, error) {
return newHandle(netns.None(), netns.None(), nlFamilies...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did initially, but @corhere pointed out ...

My only complaint is the extra pointer indirection. Since nlwrap.Handle is effectively a reference type, a *nlwrap.Handle is laid out in memory identically to a **netlink.Handle. Passing around nlwrap.Handle values by-value would be more efficient
It's a reference type in the same vein as slices -- a value which carries a reference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, hm, yeah, I was trying to see if we could make it a full drop-in replacement

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@corhere PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants