After retries, use netlink data even if the dump is still interrupted#48598
After retries, use netlink data even if the dump is still interrupted#48598thaJeztah merged 3 commits intomoby:masterfrom
Conversation
…93d350 Particularly for: - Preserve results when NLM_F_DUMP_INTR is set (1018). - Fix SetSendTimeout/SetReceiveTimeout (1012). full diff: vishvananda/netlink@v1.3.0...084abd9 Signed-off-by: Rob Murray <[email protected]>
Signed-off-by: Rob Murray <[email protected]>
internal/nlwrap/nlwrap_linux.go
Outdated
| // error. Discard the error and return the data. This restores the behaviour of | ||
| // the netlink package prior to v1.2.1, in which NLM_F_DUMP_INTR was ignored in | ||
| // the netlink response. | ||
| log.G(context.TODO()).Warnf("discarding ErrDumpInterrupted: %+v", perrors.WithStack(err)) |
There was a problem hiding this comment.
Wondering if the stack trace really needs to be logged as a Warning. Since this was the old behavior, I think we'd know by now if it was really causing any issue. I'd rather log that with debug level.
There was a problem hiding this comment.
Maybe we wouldn't know though - "sometimes on my host" issues might have been written off as "can't repro"?
I think it's unlikely we'll see failures after five retries. But, if we do, it's probably something we want to know about. And, without a stack trace, there's no way to work out whether the error relates to any observed misbehaviour.
So, I think it's worth keeping the stack trace at (at-least) info level - at least for 28.x, in case there is an issue in the real world?
There was a problem hiding this comment.
Fine, let's keep it as a Warning then.
There was a problem hiding this comment.
Do we need a tracking issue for updating it after 28.0?
There was a problem hiding this comment.
I don't think so ... either we'll see lots of errors with stack traces or, I think more likely, we won't see any.
If we don't see any, might as well leave it in - in case an environment where it's a problem crops up. If we do see the error happening a lot, it'll need addressing anyway.
| github.com/tonistiigi/go-archvariant v1.0.0 | ||
| github.com/vbatts/tar-split v0.11.5 | ||
| github.com/vishvananda/netlink v1.3.0 | ||
| github.com/vishvananda/netlink v1.3.1-0.20240922070040-084abd93d350 |
There was a problem hiding this comment.
Did we have any pending PRs still in nettlink?
Otherwise I want to give it a try as well to see if they're willing to tag a new release.
There was a problem hiding this comment.
Nothing pending. So, yes, it's probably worth asking.
There was a problem hiding this comment.
I left a comment on your last merged PR 🤞
There was a problem hiding this comment.
No response yet ... shall we merge this to get tests running with it, and before it rots - then update the release tag when one's available?
thaJeztah
left a comment
There was a problem hiding this comment.
LGTM
one minor suggestion, but no blocker if you want this merged
internal/nlwrap/nlwrap_linux.go
Outdated
| "errors" | ||
|
|
||
| "github.com/containerd/log" | ||
| perrors "github.com/pkg/errors" |
There was a problem hiding this comment.
We could also swap out one for the other, instead of importing both (looks like it's only using 'errors.Is' which I think is an alias in pkg/errors)
internal/nlwrap/nlwrap_linux.go
Outdated
| // error. Discard the error and return the data. This restores the behaviour of | ||
| // the netlink package prior to v1.2.1, in which NLM_F_DUMP_INTR was ignored in | ||
| // the netlink response. | ||
| log.G(context.TODO()).Warnf("discarding ErrDumpInterrupted: %+v", perrors.WithStack(err)) |
There was a problem hiding this comment.
Do we need a tracking issue for updating it after 28.0?
Returning possibly inconsistent data avoids retrying indefinitely, and matches the behaviour of vishvananda/netlink prior to version 1.2.1, in which the NLM_F_DUMP_INTR flag was ignored. Signed-off-by: Rob Murray <[email protected]>
5bd1d1b to
a0a0bba
Compare
| // The wrapped functions currently return EINTR when NLM_F_DUMP_INTR flagged | ||
| // in a netlink response, meaning something changed during the dump so results | ||
| // may be incomplete or inconsistent. | ||
| // When netlink.ErrDumpInterrupted is returned, the wrapped functions retry up to |
There was a problem hiding this comment.
Oh! Definitely not an issue, but if you put these things in square brackets, then pkg.go.dev will use them as docs-links, i.e.
| // When netlink.ErrDumpInterrupted is returned, the wrapped functions retry up to | |
| // When [netlink.ErrDumpInterrupted] is returned, the wrapped functions retry up to |
Will make netlink.ErrDumpInterrupted a clickable link to the pkg.go.dev documentation of netlink (which is pretty neat).
(Custom reference links are also supported)
- What I did
EINTRregression after updating github.com/vishvananda/netlink to v1.3.0 #48400Before version 1.2.1, the
vishvananda/netlinkpackage ignoredNLM_F_DUMP_INTRin netlink responses. In 1.2.1, it started to returnEINTRand no data in that case.#48407 wrapped affected netlink calls in the
nlwrappackage. So, there would be up-to 5 attempts to make a request, getEINTR, and retry. After 5 attempts, theEINTRwas returned and the caller would handle the failure.Now, the netlink package returns data along with
netlink.ErrDumpInterruptedin this case.So, after 5 attempts, the
nlwrapfunctions now discard the error and return the last set of data they got - which may be incomplete or inconsistent. But, this matches the behaviour of vishvananda/netlink prior to version 1.2.1, in which the NLM_F_DUMP_INTR flag was ignored.If this ever happens, a stack trace will be logged.
- How I did it
visvananda/netlinkpackage at its current-latest commit, which:netlink.ErrDumpInterruptedwhen it encounters anNLM_F_DUMP_INTRerror (rather than discarding results and returningEINTR).nlwrappackage.nlwrap, after maxRetries, discard ErrDumpInterrupted, log a stack trace, and return data.- How to verify it
Not tested directly, but after temporarily changing the
nlh.AddrListwrapper to:... checked that daemon startup continued normally, with log messages like this:
- Description for the changelog