fix: increase ImagePullProgressTimeout as 10min by andyzhangx · Pull Request #9347 · containerd/containerd

andyzhangx · 2023-11-08T11:13:06Z

#6150 defines ImagePullProgressTimeout as 1min, while this default timeout is sometimes too small when 1) pulling image is slow 2) big image layer extraction on the local disk costing more time 3) pulling big image, this PR tries to increase ImagePullProgressTimeout as 10min

k8s-ci-robot · 2023-11-08T11:13:15Z

Hi @andyzhangx. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2023-11-08T11:14:35Z

/ok-to-test

k8s-ci-robot · 2023-11-08T11:14:50Z

@andyzhangx: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fuweid · 2023-11-09T02:13:44Z

Hi @andyzhangx the commit needs your sign off. Please use 'git commit -s' to sign off. Thanks

samuelkarp · 2023-11-09T06:13:30Z

Coincidentally, I was talking about this with @ruiwen-zhao and @qiutongs. We haven't been able to repro it yet, but we did get a report of this error occurring.

I don't really think bumping the timeout alone is a reasonable fix. Looking at the design of pullProgressReporter and countingReadCloser, the intent is to capture whether progress is occurring in the download and the number of bytes read from the wrapped io.ReadCloser along with a timer are used as the method.

In the report that we have, we don't see anything suspicious about the network itself causing requests to take longer to read. However, my (unverified) working theory is that there's some delay outside the countingReadCloser such that countingReadCloser.Read isn't called within the timer that pullProgressReporter is running, and the timeout seen is because of that. In other words: if countingReadCloser.Read isn't called, r.reqReporter.incByteRead won't be incremented, and we'll exceed reporter.timeout.

@fuweid Does this sound like a reasonable theory to you?

andyzhangx · 2023-11-09T06:22:23Z

@samuelkarp we have at least two users, they set this value as 10min by using this daemonset, and finally it works: https://github.com/andyzhangx/demo/blob/master/aks/set-image-pull-progress-timeout.yaml#L39, I think this PR would work to some extent, it may be more helpful if we increase other timeout values.

andyzhangx · 2023-11-09T08:18:38Z

btw, feel free to use following daemonset to reconfigure the default ImagePullProgressTimeout:

kubectl apply -f https://raw.githubusercontent.com/andyzhangx/demo/master/aks/set-image-pull-progress-timeout.yaml

andyzhangx · 2023-11-09T08:19:52Z

@fuweid what does this test failure mean?

https://github.com/containerd/containerd/actions/runs/6807043328/job/18509239645?pr=9347

  + /home/runner/work/_actions/containerd/project-checks/v1.1.0/script/validate/dco
  /home/runner/work/containerd/containerd/bin/git-validation
  time="2023-11-09T03:28:59Z" level=info msg="using commit range: 5149050d6bca1ff28c914e50348b3c479ff9b597..HEAD"
   * c8a84671 "Merge 843033e1c0b578d9ec4a21cb916d00e62cb14adf into c2b39b139e28a7519db614b3ef91767ef8ef0e75" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - merge commits do not require DCO
    - PASS - merge commits do not require length check
   * 843033e1 "fix: increase ImagePullProgressTimeout as 10min" ... FAIL
    - PASS - commit does not have any whitespace errors
    - FAIL - does not have a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*
   * c2b39b13 "Merge pull request #9349 from fuweid/deflaky-TestIssue9103" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - merge commits do not require DCO
    - PASS - merge commits do not require length check
   * dac056fe "integration: deflake TestIssue9103" ... PASS
    - PASS - commit does not have any whitespace errors
    - PASS - has a valid DCO
    - PASS - commit subject is 72 characters or less! *yay*

MartinForReal · 2023-11-09T10:22:25Z

@andyzhangx Please sign the commit. refer to https://github.com/microsoft/vscode/wiki/Commit-Signing

Signed-off-by: Andy Zhang <[email protected]>

andyzhangx · 2023-11-09T10:45:09Z

@andyzhangx Please sign the commit. refer to https://github.com/microsoft/vscode/wiki/Commit-Signing

@MartinForReal that's a little complicated, I only need to append Signed-off-by: Andy Zhang <[email protected]> in the commit msg, thanks anyway.

ruiwen-zhao · 2023-11-10T22:15:48Z

slow 2) big image layer extraction on the local disk costing more time 3) pulling big image, this PR tries to increase ImagePullProgressTimeout as 10min

The timeout will only be triggered then there is NO progress reading from remote registry in the past 1 min, which should be irrelevant to the size of the image.

@andyzhangx did you find out why it didn't make progress reading from the registry? Was it because countingReadCloser.Read isn't called within the timer that pullProgressReporter is running, as @samuelkarp mentioned above?

thaJeztah · 2023-11-11T10:26:08Z

I don't really think bumping the timeout alone is a reasonable fix. Looking at the design of pullProgressReporter and countingReadCloser, the intent is to capture whether progress is occurring in the download and the number of bytes read from the wrapped io.ReadCloser along with a timer are used as the method.

I was curious if the purpose of the field was properly documented (as that would be a good improvement if it wasn't), but it looks like it is;

containerd/pkg/cri/config/config.go

Lines 317 to 320 in 45d7f23

    
           // ImagePullProgressTimeout is the maximum duration that there is no 
        
           // image data read from image registry in the open connection. It will 
        
           // be reset whatever a new byte has been read. If timeout, the image 
        
           // pulling will be cancelled. A zero value means there is no timeout.

Also looking at the commit that introduced this timeout 00d102d (#6150);

Kubelet sends the PullImage request without timeout, because the image size
is unknown and timeout is hard to defined. The pulling request might run
into 0B/s speed, if containerd can't receive any packet in that connection.
For this case, the containerd should cancel the PullImage request.

That confirms that this option is intended to provide a time-out if no bytes were received within the given time-limit (1 minute).

we have at least two users, they set this value as 10min by using this daemonset, and finally it works

It's worth checking with them, but from the discussion here, I suspect it worked if the whole image was pulled before the 10 minutes timeout was reached. If that's the case, then it would be tapering over the actual problem.

There is no hard limit on image size and, while rare (at least for Linux, Windows images tend to be larger), multi-gigabyte images do exist, and so do "slow networking connections". For those reasons, we cannot make good assumptions how long an image pull can take (even a small image can take long to pull on a poor network connection); pulls that take a long time should be supported, although we can consider having a (user-configurable) hard-limit for pulls. (In addition, pulls may be constrained by other factors, such as auth / tokens expiring, but that's likely out of our hands).

The existing "1-minute" timeout is already rather generous (we can expect at least some bytes to arrive within a minute, even on a slow network connection), so the intent here looks to be to protect against exceptional situations. If we're seeing timeouts here, it sounds very plausible that there's a bug at hand, and that should be fixed.

All that said; even with that bug fixed, I think it would be a good thing to

document where the 1 minute timeout came from; it looks to be an arbitrary value ("1 minute should be look enough"), which is fine, but it helps to document this. Given that both the Windows and Unix variants use the same value, perhaps define a const for this, and add documentation to that const (it it's only used internally, wer don't have to export the const yet).
check what error is returned on timeout; I haven't looked closely at the implementation, but if we're cancelling the context, we could use the new context.WithCancelCause() to provide more information (i.e., "no bytes received within <configured timeout>")

samuelkarp · 2023-11-14T18:31:21Z

The existing "1-minute" timeout is already rather generous (we can expect at least some bytes to arrive within a minute, even on a slow network connection), so the intent here looks to be to protect against exceptional situations. If we're seeing timeouts here, it sounds very plausible that there's a bug at hand, and that should be fixed.

Yep, that's exactly why I'm suspecting more of a bug with the accounting logic than the connection truly drops to 0 B/s for a minute.

fuweid · 2023-11-14T19:24:30Z

@samuelkarp @thaJeztah the 1-min is from kubelet setting for dockershim. Sorry for late reply. I created flakey case for tracking something like that #8024. But I am still rethinking which part went wrong. The ticker might be fired when the active request just comes, like #8024. And then the tracker saw one active request but no bytes increased. So it fired the timeout. It's possible and I did see it in the CI because the CI timeout is short than 1-min. Might I should refresh the lastSeenTimestamp when we have new active request. But I am still worry about the IO pressure created umount action or unpacking huge layer which triggered by WriteBack thread (ext4 has default commit=5).

Let me rethink it and update it later. Thanks

Signed-off-by: Wei Fu <[email protected]>

thaJeztah · 2023-11-15T13:49:42Z

the 1-min is from kubelet setting for dockershim.

Thanks for that added info. Yes, it would be good to document that. "magic" values are always tricky if they're not "trivial"; it's easy to forget where they came from (and can easily lead to wrong conclusions)

Sorry for late reply.

Ohman, no worries; no need to apologize! Thanks!

Signed-off-by: Wei Fu <[email protected]>

fuweid · 2023-11-21T10:20:30Z

closed by #9369

Signed-off-by: Wei Fu <[email protected]>

Signed-off-by: Wei Fu <[email protected]> Signed-off-by: Vivek Radhakrishnan <[email protected]>

Signed-off-by: Wei Fu <[email protected]>

k8s-ci-robot added the needs-ok-to-test label Nov 8, 2023

andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 57b1e8c to 4589891 Compare November 8, 2023 11:37

fuweid added ok-to-test and removed needs-ok-to-test labels Nov 9, 2023

fuweid reviewed Nov 9, 2023

View reviewed changes

Comment thread pkg/cri/config/config_unix.go Outdated

fuweid added the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Nov 9, 2023

andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 4589891 to 843033e Compare November 9, 2023 03:28

andyzhangx changed the title ~~fix: increase ImagePullProgressTimeout as 5min~~ fix: increase ImagePullProgressTimeout as 10min Nov 9, 2023

fix: increase ImagePullProgressTimeout as 10min

1ee23e8

Signed-off-by: Andy Zhang <[email protected]>

andyzhangx force-pushed the increase-ImagePullProgressTimeout branch from 843033e to 1ee23e8 Compare November 9, 2023 10:34

fuweid added a commit to fuweid/containerd that referenced this pull request Nov 15, 2023

integration: reproduce containerd#9347

20527ce

Signed-off-by: Wei Fu <[email protected]>

fuweid mentioned this pull request Nov 15, 2023

fix: ImagePull should close http connection if there is no available data to read. #9369

Merged

fuweid added a commit to fuweid/containerd that referenced this pull request Nov 17, 2023

integration: reproduce containerd#9347

5909088

Signed-off-by: Wei Fu <[email protected]>

fuweid added a commit to fuweid/containerd that referenced this pull request Nov 18, 2023

integration: reproduce containerd#9347

7f410ae

Signed-off-by: Wei Fu <[email protected]>

fuweid removed the cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch label Nov 21, 2023

fuweid closed this Nov 21, 2023

ruiwen-zhao pushed a commit to ruiwen-zhao/containerd that referenced this pull request Nov 21, 2023

integration: reproduce containerd#9347

3284939

Signed-off-by: Wei Fu <[email protected]>

This was referenced Nov 27, 2023

Add release notes for v1.7.10 #9425

Closed

Add release notes for v1.7.10 #9426

Merged

smallfoot47 pushed a commit to smallfoot47/containerd that referenced this pull request Dec 11, 2023

integration: reproduce containerd#9347

6fbad23

Signed-off-by: Wei Fu <[email protected]> Signed-off-by: Vivek Radhakrishnan <[email protected]>

juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024

integration: reproduce containerd#9347

faf59e4

Signed-off-by: Wei Fu <[email protected]>

Conversation

andyzhangx commented Nov 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Nov 8, 2023

Uh oh!

andyzhangx commented Nov 8, 2023

Uh oh!

k8s-ci-robot commented Nov 8, 2023

Uh oh!

fuweid commented Nov 9, 2023

Uh oh!

Uh oh!

samuelkarp commented Nov 9, 2023

Uh oh!

andyzhangx commented Nov 9, 2023

Uh oh!

andyzhangx commented Nov 9, 2023

Uh oh!

andyzhangx commented Nov 9, 2023

Uh oh!

MartinForReal commented Nov 9, 2023

Uh oh!

andyzhangx commented Nov 9, 2023

Uh oh!

ruiwen-zhao commented Nov 10, 2023

Uh oh!

thaJeztah commented Nov 11, 2023

Uh oh!

samuelkarp commented Nov 14, 2023

Uh oh!

fuweid commented Nov 14, 2023

Uh oh!

thaJeztah commented Nov 15, 2023

Uh oh!

fuweid commented Nov 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

andyzhangx commented Nov 8, 2023 •

edited

Loading