Skip to content

ensure we cancel child context when reading grpc#4629

Merged
mxpv merged 1 commit intocontainerd:masterfrom
deitch:cancel-read-context
Oct 14, 2020
Merged

ensure we cancel child context when reading grpc#4629
mxpv merged 1 commit intocontainerd:masterfrom
deitch:cancel-read-context

Conversation

@deitch
Copy link
Copy Markdown
Contributor

@deitch deitch commented Oct 14, 2020

Fixes #4617

Short summary: we go down the following chain when we try to read data from containerd via grpc:

  1. NewContentStore returns a content.Store implementation, specifically, proxyContentStore
  2. When attempting to get a reader via proxyContentStore.ReaderAt returns a content.ReaderAt implementation, specifically remoteReaderAt
  3. When remoteReaderAt attempts to ReadAt, it calls api/services/content/v1.ContentClient.Read() here
  4. The protobuf-generated go code for v1.ContentClient.Read() creates a grpc.ClientConn.NewStream

The comments on NewStream explicitly state:

ctx is used for the lifetime of the stream

Yet we have the single context for the entire life of the connection. With a short content blob, it doesn't matter; with a larger one, you have leakages of hundreds, thousands or tens of thousands of goroutines (and whatever other resources, like memory) are attendant to them.

This creates a derived child context for each ReadAt(), and therefore each NewStream(), and then cancels the child context only as soon as the ReadAt() is done.

See the attached issue #4617 for much more detail. Short and simple test:

  1. Run ctr content get for some large blob
  2. While halfway through, do kill -USR1 <ctr_pid> or, if you don't mind killing it entirely, kill -ABRT <ctr_pid>
  3. See thousands of goroutines in select state with grpc.newClientStream
  4. Rebuild with this patch
  5. Repeat
  6. See exactly one goroutine in select state with grpc.newClientStream - it is fixed

As discussed with @dmcgowan

@k8s-ci-robot
Copy link
Copy Markdown

Hi @deitch. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Oct 14, 2020

Build succeeded.

@AkihiroSuda
Copy link
Copy Markdown
Member

/ok-to-test

@deitch
Copy link
Copy Markdown
Contributor Author

deitch commented Oct 14, 2020

Thanks @AkihiroSuda 😄

Copy link
Copy Markdown
Member

@mxpv mxpv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first thrown off by:

  1. Call RecvMsg until a non-nil error is returned. A protobuf-generated
    client-streaming RPC, for instance, might use the helper function
    CloseAndRecv (note that CloseSend does not Recv, therefore is not
    guaranteed to release all resources).

But the loop doesn't necessarily read until an error since it's based on len(p) > 0.
So, LGTM.

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mxpv mxpv merged commit 22aea1e into containerd:master Oct 14, 2020
@deitch deitch deleted the cancel-read-context branch October 15, 2020 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zombie goroutines in client due to missing ClientStreams

6 participants