fix: increase ImagePullProgressTimeout as 10min#9347
fix: increase ImagePullProgressTimeout as 10min#9347andyzhangx wants to merge 1 commit intocontainerd:mainfrom
Conversation
|
Hi @andyzhangx. Thanks for your PR. I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/ok-to-test |
|
@andyzhangx: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
57b1e8c to
4589891
Compare
|
Hi @andyzhangx the commit needs your sign off. Please use 'git commit -s' to sign off. Thanks |
4589891 to
843033e
Compare
|
Coincidentally, I was talking about this with @ruiwen-zhao and @qiutongs. We haven't been able to repro it yet, but we did get a report of this error occurring. I don't really think bumping the timeout alone is a reasonable fix. Looking at the design of In the report that we have, we don't see anything suspicious about the network itself causing requests to take longer to read. However, my (unverified) working theory is that there's some delay outside the @fuweid Does this sound like a reasonable theory to you? |
|
@samuelkarp we have at least two users, they set this value as 10min by using this daemonset, and finally it works: https://github.com/andyzhangx/demo/blob/master/aks/set-image-pull-progress-timeout.yaml#L39, I think this PR would work to some extent, it may be more helpful if we increase other timeout values. |
|
btw, feel free to use following daemonset to reconfigure the default ImagePullProgressTimeout: |
|
@fuweid what does this test failure mean? |
|
@andyzhangx Please sign the commit. refer to https://github.com/microsoft/vscode/wiki/Commit-Signing |
Signed-off-by: Andy Zhang <[email protected]>
843033e to
1ee23e8
Compare
@MartinForReal that's a little complicated, I only need to append |
The timeout will only be triggered then there is NO progress reading from remote registry in the past 1 min, which should be irrelevant to the size of the image. @andyzhangx did you find out why it didn't make progress reading from the registry? Was it because |
I was curious if the purpose of the field was properly documented (as that would be a good improvement if it wasn't), but it looks like it is; containerd/pkg/cri/config/config.go Lines 317 to 320 in 45d7f23 Also looking at the commit that introduced this timeout 00d102d (#6150); That confirms that this option is intended to provide a time-out if no bytes were received within the given time-limit (1 minute).
It's worth checking with them, but from the discussion here, I suspect it worked if the whole image was pulled before the 10 minutes timeout was reached. If that's the case, then it would be tapering over the actual problem. There is no hard limit on image size and, while rare (at least for Linux, Windows images tend to be larger), multi-gigabyte images do exist, and so do "slow networking connections". For those reasons, we cannot make good assumptions how long an image pull can take (even a small image can take long to pull on a poor network connection); pulls that take a long time should be supported, although we can consider having a (user-configurable) hard-limit for pulls. (In addition, pulls may be constrained by other factors, such as auth / tokens expiring, but that's likely out of our hands). The existing "1-minute" timeout is already rather generous (we can expect at least some bytes to arrive within a minute, even on a slow network connection), so the intent here looks to be to protect against exceptional situations. If we're seeing timeouts here, it sounds very plausible that there's a bug at hand, and that should be fixed. All that said; even with that bug fixed, I think it would be a good thing to
|
Yep, that's exactly why I'm suspecting more of a bug with the accounting logic than the connection truly drops to 0 B/s for a minute. |
|
@samuelkarp @thaJeztah the 1-min is from kubelet setting for dockershim. Sorry for late reply. I created flakey case for tracking something like that #8024. But I am still rethinking which part went wrong. The ticker might be fired when the active request just comes, like #8024. And then the tracker saw one active request but no bytes increased. So it fired the timeout. It's possible and I did see it in the CI because the CI timeout is short than 1-min. Might I should refresh the lastSeenTimestamp when we have new active request. But I am still worry about the IO pressure created umount action or unpacking huge layer which triggered by WriteBack thread (ext4 has default commit=5). Let me rethink it and update it later. Thanks |
Signed-off-by: Wei Fu <[email protected]>
Thanks for that added info. Yes, it would be good to document that. "magic" values are always tricky if they're not "trivial"; it's easy to forget where they came from (and can easily lead to wrong conclusions)
Ohman, no worries; no need to apologize! Thanks! |
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
|
closed by #9369 |
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]> Signed-off-by: Vivek Radhakrishnan <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
#6150 defines ImagePullProgressTimeout as 1min, while this default timeout is sometimes too small when 1) pulling image is slow 2) big image layer extraction on the local disk costing more time 3) pulling big image, this PR tries to increase ImagePullProgressTimeout as 10min