Skip to content

fix: increase ImagePullProgressTimeout as 10min#9347

Closed
andyzhangx wants to merge 1 commit intocontainerd:mainfrom
andyzhangx:increase-ImagePullProgressTimeout
Closed

fix: increase ImagePullProgressTimeout as 10min#9347
andyzhangx wants to merge 1 commit intocontainerd:mainfrom
andyzhangx:increase-ImagePullProgressTimeout

Conversation

@andyzhangx
Copy link
Copy Markdown

@andyzhangx andyzhangx commented Nov 8, 2023

#6150 defines ImagePullProgressTimeout as 1min, while this default timeout is sometimes too small when 1) pulling image is slow 2) big image layer extraction on the local disk costing more time 3) pulling big image, this PR tries to increase ImagePullProgressTimeout as 10min

@k8s-ci-robot
Copy link
Copy Markdown

Hi @andyzhangx. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andyzhangx
Copy link
Copy Markdown
Author

/ok-to-test

@k8s-ci-robot
Copy link
Copy Markdown

@andyzhangx: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andyzhangx andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 57b1e8c to 4589891 Compare November 8, 2023 11:37
@fuweid
Copy link
Copy Markdown
Member

fuweid commented Nov 9, 2023

Hi @andyzhangx the commit needs your sign off. Please use 'git commit -s' to sign off. Thanks

Comment thread pkg/cri/config/config_unix.go Outdated
@fuweid fuweid added the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Nov 9, 2023
@andyzhangx andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 4589891 to 843033e Compare November 9, 2023 03:28
@andyzhangx andyzhangx changed the title fix: increase ImagePullProgressTimeout as 5min fix: increase ImagePullProgressTimeout as 10min Nov 9, 2023
@samuelkarp
Copy link
Copy Markdown
Member

Coincidentally, I was talking about this with @ruiwen-zhao and @qiutongs. We haven't been able to repro it yet, but we did get a report of this error occurring.

I don't really think bumping the timeout alone is a reasonable fix. Looking at the design of pullProgressReporter and countingReadCloser, the intent is to capture whether progress is occurring in the download and the number of bytes read from the wrapped io.ReadCloser along with a timer are used as the method.

In the report that we have, we don't see anything suspicious about the network itself causing requests to take longer to read. However, my (unverified) working theory is that there's some delay outside the countingReadCloser such that countingReadCloser.Read isn't called within the timer that pullProgressReporter is running, and the timeout seen is because of that. In other words: if countingReadCloser.Read isn't called, r.reqReporter.incByteRead won't be incremented, and we'll exceed reporter.timeout.

@fuweid Does this sound like a reasonable theory to you?

@andyzhangx
Copy link
Copy Markdown
Author

@samuelkarp we have at least two users, they set this value as 10min by using this daemonset, and finally it works: https://github.com/andyzhangx/demo/blob/master/aks/set-image-pull-progress-timeout.yaml#L39, I think this PR would work to some extent, it may be more helpful if we increase other timeout values.

@andyzhangx
Copy link
Copy Markdown
Author

btw, feel free to use following daemonset to reconfigure the default ImagePullProgressTimeout:

kubectl apply -f https://raw.githubusercontent.com/andyzhangx/demo/master/aks/set-image-pull-progress-timeout.yaml 

@andyzhangx
Copy link
Copy Markdown
Author

@fuweid what does this test failure mean?

https://github.com/containerd/containerd/actions/runs/6807043328/job/18509239645?pr=9347

  + /home/runner/work/_actions/containerd/project-checks/v1.1.0/script/validate/dco
  /home/runner/work/containerd/containerd/bin/git-validation
  time="2023-11-09T03:28:59Z" level=info msg="using commit range: 5149050d6bca1ff28c914e50348b3c479ff9b597..HEAD"
   * c8a84671 "Merge 843033e1c0b578d9ec4a21cb916d00e62cb14adf into c2b39b139e28a7519db614b3ef91767ef8ef0e75" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - merge commits do not require DCO
    - PASS - merge commits do not require length check
   * 843033e1 "fix: increase ImagePullProgressTimeout as 10min" ... FAIL
    - PASS - commit does not have any whitespace errors
    - FAIL - does not have a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*
   * c2b39b13 "Merge pull request #9349 from fuweid/deflaky-TestIssue9103" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - merge commits do not require DCO
    - PASS - merge commits do not require length check
   * dac056fe "integration: deflake TestIssue9103" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - has a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*

@MartinForReal
Copy link
Copy Markdown

@andyzhangx andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 843033e to 1ee23e8 Compare November 9, 2023 10:34
@andyzhangx
Copy link
Copy Markdown
Author

@andyzhangx Please sign the commit. refer to https://github.com/microsoft/vscode/wiki/Commit-Signing

@MartinForReal that's a little complicated, I only need to append Signed-off-by: Andy Zhang <[email protected]> in the commit msg, thanks anyway.

@ruiwen-zhao
Copy link
Copy Markdown
Member

slow 2) big image layer extraction on the local disk costing more time 3) pulling big image, this PR tries to increase ImagePullProgressTimeout as 10min

The timeout will only be triggered then there is NO progress reading from remote registry in the past 1 min, which should be irrelevant to the size of the image.

@andyzhangx did you find out why it didn't make progress reading from the registry? Was it because countingReadCloser.Read isn't called within the timer that pullProgressReporter is running, as @samuelkarp mentioned above?

@thaJeztah
Copy link
Copy Markdown
Member

I don't really think bumping the timeout alone is a reasonable fix. Looking at the design of pullProgressReporter and countingReadCloser, the intent is to capture whether progress is occurring in the download and the number of bytes read from the wrapped io.ReadCloser along with a timer are used as the method.

I was curious if the purpose of the field was properly documented (as that would be a good improvement if it wasn't), but it looks like it is;

// ImagePullProgressTimeout is the maximum duration that there is no
// image data read from image registry in the open connection. It will
// be reset whatever a new byte has been read. If timeout, the image
// pulling will be cancelled. A zero value means there is no timeout.

Also looking at the commit that introduced this timeout 00d102d (#6150);

Kubelet sends the PullImage request without timeout, because the image size
is unknown and timeout is hard to defined. The pulling request might run
into 0B/s speed, if containerd can't receive any packet in that connection.
For this case, the containerd should cancel the PullImage request.

That confirms that this option is intended to provide a time-out if no bytes were received within the given time-limit (1 minute).

we have at least two users, they set this value as 10min by using this daemonset, and finally it works

It's worth checking with them, but from the discussion here, I suspect it worked if the whole image was pulled before the 10 minutes timeout was reached. If that's the case, then it would be tapering over the actual problem.

There is no hard limit on image size and, while rare (at least for Linux, Windows images tend to be larger), multi-gigabyte images do exist, and so do "slow networking connections". For those reasons, we cannot make good assumptions how long an image pull can take (even a small image can take long to pull on a poor network connection); pulls that take a long time should be supported, although we can consider having a (user-configurable) hard-limit for pulls. (In addition, pulls may be constrained by other factors, such as auth / tokens expiring, but that's likely out of our hands).

The existing "1-minute" timeout is already rather generous (we can expect at least some bytes to arrive within a minute, even on a slow network connection), so the intent here looks to be to protect against exceptional situations. If we're seeing timeouts here, it sounds very plausible that there's a bug at hand, and that should be fixed.

All that said; even with that bug fixed, I think it would be a good thing to

  • document where the 1 minute timeout came from; it looks to be an arbitrary value ("1 minute should be look enough"), which is fine, but it helps to document this. Given that both the Windows and Unix variants use the same value, perhaps define a const for this, and add documentation to that const (it it's only used internally, wer don't have to export the const yet).
  • check what error is returned on timeout; I haven't looked closely at the implementation, but if we're cancelling the context, we could use the new context.WithCancelCause() to provide more information (i.e., "no bytes received within <configured timeout>")

@samuelkarp
Copy link
Copy Markdown
Member

The existing "1-minute" timeout is already rather generous (we can expect at least some bytes to arrive within a minute, even on a slow network connection), so the intent here looks to be to protect against exceptional situations. If we're seeing timeouts here, it sounds very plausible that there's a bug at hand, and that should be fixed.

Yep, that's exactly why I'm suspecting more of a bug with the accounting logic than the connection truly drops to 0 B/s for a minute.

@fuweid
Copy link
Copy Markdown
Member

fuweid commented Nov 14, 2023

@samuelkarp @thaJeztah the 1-min is from kubelet setting for dockershim. Sorry for late reply. I created flakey case for tracking something like that #8024. But I am still rethinking which part went wrong. The ticker might be fired when the active request just comes, like #8024. And then the tracker saw one active request but no bytes increased. So it fired the timeout. It's possible and I did see it in the CI because the CI timeout is short than 1-min. Might I should refresh the lastSeenTimestamp when we have new active request. But I am still worry about the IO pressure created umount action or unpacking huge layer which triggered by WriteBack thread (ext4 has default commit=5).

Let me rethink it and update it later. Thanks

fuweid added a commit to fuweid/containerd that referenced this pull request Nov 15, 2023
@thaJeztah
Copy link
Copy Markdown
Member

the 1-min is from kubelet setting for dockershim.

Thanks for that added info. Yes, it would be good to document that. "magic" values are always tricky if they're not "trivial"; it's easy to forget where they came from (and can easily lead to wrong conclusions)

Sorry for late reply.

Ohman, no worries; no need to apologize! Thanks!

fuweid added a commit to fuweid/containerd that referenced this pull request Nov 17, 2023
fuweid added a commit to fuweid/containerd that referenced this pull request Nov 18, 2023
@fuweid fuweid removed the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Nov 21, 2023
@fuweid
Copy link
Copy Markdown
Member

fuweid commented Nov 21, 2023

closed by #9369

@fuweid fuweid closed this Nov 21, 2023
ruiwen-zhao pushed a commit to ruiwen-zhao/containerd that referenced this pull request Nov 21, 2023
smallfoot47 pushed a commit to smallfoot47/containerd that referenced this pull request Dec 11, 2023
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Vivek Radhakrishnan <[email protected]>
juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants