Skip to content

After retries, use netlink data even if the dump is still interrupted#48598

Merged
thaJeztah merged 3 commits intomoby:masterfrom
robmry:netlink_dump_interrupted
Oct 15, 2024
Merged

After retries, use netlink data even if the dump is still interrupted#48598
thaJeztah merged 3 commits intomoby:masterfrom
robmry:netlink_dump_interrupted

Conversation

@robmry
Copy link
Contributor

@robmry robmry commented Oct 7, 2024

- What I did

Before version 1.2.1, the vishvananda/netlink package ignored NLM_F_DUMP_INTR in netlink responses. In 1.2.1, it started to return EINTR and no data in that case.

#48407 wrapped affected netlink calls in the nlwrap package. So, there would be up-to 5 attempts to make a request, get EINTR, and retry. After 5 attempts, the EINTR was returned and the caller would handle the failure.

Now, the netlink package returns data along with netlink.ErrDumpInterrupted in this case.

So, after 5 attempts, the nlwrap functions now discard the error and return the last set of data they got - which may be incomplete or inconsistent. But, this matches the behaviour of vishvananda/netlink prior to version 1.2.1, in which the NLM_F_DUMP_INTR flag was ignored.

If this ever happens, a stack trace will be logged.

- How I did it

  • Pick up the visvananda/netlink package at its current-latest commit, which:
    • fixes timeouts, and
    • returns data along with netlink.ErrDumpInterrupted when it encounters an NLM_F_DUMP_INTR error (rather than discarding results and returning EINTR).
    • Full diff: vishvananda/netlink@v1.3.0...084abd9
  • Fix godoc comments in the nlwrap package.
  • In nlwrap, after maxRetries, discard ErrDumpInterrupted, log a stack trace, and return data.

- How to verify it

Not tested directly, but after temporarily changing the nlh.AddrList wrapper to:

--- a/internal/nlwrap/nlwrap_linux.go
+++ b/internal/nlwrap/nlwrap_linux.go
@@ -82,6 +82,7 @@ func discardErrDumpInterrupted(err error) error {
 func (nlh Handle) AddrList(link netlink.Link, family int) (addrs []netlink.Addr, err error) {
        retryOnIntr(func() error {
                addrs, err = nlh.Handle.AddrList(link, family) //nolint:forbidigo
+               err = netlink.ErrDumpInterrupted
                return err
        })
        return addrs, discardErrDumpInterrupted(err)
 }

... checked that daemon startup continued normally, with log messages like this:

INFO[2024-10-07T13:00:44.099150679Z] netlink call interrupted after 5 attempts
WARN[2024-10-07T13:00:44.099197470Z] discarding ErrDumpInterrupted: results may be incomplete or inconsistent
github.com/docker/docker/internal/nlwrap.discardErrDumpInterrupted
	/go/src/github.com/docker/docker/internal/nlwrap/nlwrap_linux.go:75
github.com/docker/docker/internal/nlwrap.Handle.AddrList
	/go/src/github.com/docker/docker/internal/nlwrap/nlwrap_linux.go:88
github.com/docker/docker/daemon.ifaceAddrs.func1
	/go/src/github.com/docker/docker/daemon/daemon_linux.go:166
github.com/docker/docker/daemon.ifaceAddrs
	/go/src/github.com/docker/docker/daemon/daemon_linux.go:178
github.com/docker/docker/daemon.initBridgeDriver
	/go/src/github.com/docker/docker/daemon/daemon_unix.go:949
github.com/docker/docker/daemon.configureNetworking
	/go/src/github.com/docker/docker/daemon/daemon_unix.go:886
github.com/docker/docker/daemon.(*Daemon).initNetworkController
	/go/src/github.com/docker/docker/daemon/daemon_unix.go:850
github.com/docker/docker/daemon.(*Daemon).restore
	/go/src/github.com/docker/docker/daemon/daemon.go:580
github.com/docker/docker/daemon.NewDaemon
	/go/src/github.com/docker/docker/daemon/daemon.go:1241
main.(*daemonCLI).start
	/go/src/github.com/docker/docker/cmd/dockerd/daemon.go:257
main.runDaemon
	/go/src/github.com/docker/docker/cmd/dockerd/docker_unix.go:13
main.newDaemonCommand.func1
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:49
github.com/spf13/cobra.(*Command).execute
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:985
github.com/spf13/cobra.(*Command).ExecuteC
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:1117
github.com/spf13/cobra.(*Command).Execute
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:1041
github.com/spf13/cobra.(*Command).ExecuteContext
	/go/src/github.com/docker/docker/vendor/github.com/spf13/cobra/command.go:1034
main.main
	/go/src/github.com/docker/docker/cmd/dockerd/docker.go:113
runtime.main
	/usr/local/go/src/runtime/proc.go:271
runtime.goexit
	/usr/local/go/src/runtime/asm_arm64.s:1222

- Description for the changelog

n/a

robmry added 2 commits October 7, 2024 18:32
…93d350

Particularly for:
- Preserve results when NLM_F_DUMP_INTR is set (1018).
- Fix SetSendTimeout/SetReceiveTimeout (1012).

full diff: vishvananda/netlink@v1.3.0...084abd9

Signed-off-by: Rob Murray <[email protected]>
Signed-off-by: Rob Murray <[email protected]>
@robmry robmry added area/networking Networking kind/bugfix PR's that fix bugs labels Oct 7, 2024
@robmry robmry added this to the 28.0.0 milestone Oct 7, 2024
@robmry robmry self-assigned this Oct 7, 2024
@robmry robmry marked this pull request as ready for review October 8, 2024 08:04
// error. Discard the error and return the data. This restores the behaviour of
// the netlink package prior to v1.2.1, in which NLM_F_DUMP_INTR was ignored in
// the netlink response.
log.G(context.TODO()).Warnf("discarding ErrDumpInterrupted: %+v", perrors.WithStack(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the stack trace really needs to be logged as a Warning. Since this was the old behavior, I think we'd know by now if it was really causing any issue. I'd rather log that with debug level.

Copy link
Contributor Author

@robmry robmry Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we wouldn't know though - "sometimes on my host" issues might have been written off as "can't repro"?

I think it's unlikely we'll see failures after five retries. But, if we do, it's probably something we want to know about. And, without a stack trace, there's no way to work out whether the error relates to any observed misbehaviour.

So, I think it's worth keeping the stack trace at (at-least) info level - at least for 28.x, in case there is an issue in the real world?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, let's keep it as a Warning then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a tracking issue for updating it after 28.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so ... either we'll see lots of errors with stack traces or, I think more likely, we won't see any.

If we don't see any, might as well leave it in - in case an environment where it's a problem crops up. If we do see the error happening a lot, it'll need addressing anyway.

github.com/tonistiigi/go-archvariant v1.0.0
github.com/vbatts/tar-split v0.11.5
github.com/vishvananda/netlink v1.3.0
github.com/vishvananda/netlink v1.3.1-0.20240922070040-084abd93d350
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we have any pending PRs still in nettlink?

Otherwise I want to give it a try as well to see if they're willing to tag a new release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing pending. So, yes, it's probably worth asking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment on your last merged PR 🤞

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No response yet ... shall we merge this to get tests running with it, and before it rots - then update the release tag when one's available?

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

one minor suggestion, but no blocker if you want this merged

"errors"

"github.com/containerd/log"
perrors "github.com/pkg/errors"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also swap out one for the other, instead of importing both (looks like it's only using 'errors.Is' which I think is an alias in pkg/errors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// error. Discard the error and return the data. This restores the behaviour of
// the netlink package prior to v1.2.1, in which NLM_F_DUMP_INTR was ignored in
// the netlink response.
log.G(context.TODO()).Warnf("discarding ErrDumpInterrupted: %+v", perrors.WithStack(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a tracking issue for updating it after 28.0?

Returning possibly inconsistent data avoids retrying indefinitely, and
matches the behaviour of vishvananda/netlink prior to version 1.2.1, in
which the NLM_F_DUMP_INTR flag was ignored.

Signed-off-by: Rob Murray <[email protected]>
@robmry robmry force-pushed the netlink_dump_interrupted branch from 5bd1d1b to a0a0bba Compare October 15, 2024 10:36
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

// The wrapped functions currently return EINTR when NLM_F_DUMP_INTR flagged
// in a netlink response, meaning something changed during the dump so results
// may be incomplete or inconsistent.
// When netlink.ErrDumpInterrupted is returned, the wrapped functions retry up to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Definitely not an issue, but if you put these things in square brackets, then pkg.go.dev will use them as docs-links, i.e.

Suggested change
// When netlink.ErrDumpInterrupted is returned, the wrapped functions retry up to
// When [netlink.ErrDumpInterrupted] is returned, the wrapped functions retry up to

Will make netlink.ErrDumpInterrupted a clickable link to the pkg.go.dev documentation of netlink (which is pretty neat).

(Custom reference links are also supported)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

networking: investigate EINTR regression after updating github.com/vishvananda/netlink to v1.3.0

3 participants