Skip to content

grpc.WithBlock() in containerd.New() swallows underlying issue when attempting to connect #2576

@dweomer

Description

@dweomer

Description

@crosbymichael / boss is pretty awesome and has spurred my interest in developing something on top of ContainerD. However, right out of the gate, working with ctr and even containerd.New() is currently block-tastic and uninformative as to why.

Steps to reproduce the issue:

  1. invoke containerd.New() via ctr (or a custom client) against a socket that doesn't exist or that you don't have permissions to will cause a hang followed by a "context deadline exceeded"

Describe the results you received:

78ee9d07f4cf:~/go/src/github.com/containerd/containerd# time su-exec nobody ctr namespace ls
ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded

real    0m9.981s
user    0m0.020s
sys     0m0.000s

Describe the results you expected:
Something like:

78ee9d07f4cf:~/go/src/github.com/containerd/containerd# time su-exec nobody ctr namespace ls
ctr: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: permission denied": unavailable

real    0m0.013s
user    0m0.000s
sys     0m0.010s

(this was achieved by commenting out with grpc.WithBlock() at

grpc.WithBlock(),
)
Output of containerd --version:

containerd github.com/containerd/containerd v1.2.0-beta.1-2-g37a6a91b 37a6a91bdf9da67a6266a0725c2f9a03c61110c4

I've looked into the grpc code (entrypoint here: https://github.com/grpc/grpc-go/blob/v1.12.0/clientconn.go#L607-L618, that is causing the issue and based on the discussion on #989 (comment) by @estesp and @stevvooe combined with the back-and-forth re: grpc.WithBlock in commits between @estesp and @dmcgowan followed up by @crosbymichael adding in the default-10s-timeout I feel like the grpc blocking connect code is not fit for purpose when working with unix sockets (as containerd exclusively does).

My proposal to fix this situation is guided by this assertion: when attempting to connect to a containerd socket that does not exist or that the current user does not have read/write permissions for, clients should immediately error out (unless they opt to wait).

I've done some searching but it isn't immediately obvious to me why the blocking behavior promised by grpc.WithBlock() (combined with the context.WithTimeout()) is even needed. The only thing I can think of is race conditions with starting up containerd's most popular client, aka dockerd (and kubelet?) because containerd as a server is spawned by it's client. If such is the actual case then my proposed fix would invert current client expectations and would thus impose upon these two very popular downstream projects a change to avoid breakage when adopting whatever version of containerd such a fix would land in.

Alternatively, if such an inversion of existing expectations of major client runtimes is not acceptable, maybe we can augment the containerd.New via the ClientOpt vararg to (or deprecate it in favor of a new implementation) perform some simple checks before prior to connecting via grpc?

I am willing to work on this, but am looking for guidance as to team preferences as well as other considerations that I may be oblivious to.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions