-
Notifications
You must be signed in to change notification settings - Fork 3.7k
feature: support image pull progress timeout #6150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Skipping CI for Draft Pull Request. |
|
Build succeeded.
|
|
/test all |
|
Build succeeded.
|
07ba1ca to
63860e9
Compare
|
Build succeeded.
|
pkg/cri/config/config.go
Outdated
| NetNSMountsUnderStateDir bool `toml:"netns_mounts_under_state_dir" json:"netnsMountsUnderStateDir"` | ||
| // ImagePullProgressDeadline is the option that If no pulling progress | ||
| // is made before this deadline, the image pulling will be cancelled. | ||
| ImagePullProgressDeadline string `toml:"image_pull_progress_deadline" json:"imagePullProgressDeadline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to note that the format expected is the go duration format like the comment for StreamIdleTimeout calls out above. https://golang.org/pkg/time/#ParseDuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod let's use the description format from stream idle above if we stick to this being a timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with image_pull_progress_timeout and used description comment like stream_idle_timeout.
pkg/cri/server/image_pull.go
Outdated
| WithField("lastSeenBytesRead", lastSeenBytesRead). | ||
| WithField("lastSeenTimestamp", lastSeenTimestamp). | ||
| WithField("reportInterval", reportInterval). | ||
| Tracef("the progress of pulling %v", reporter.ref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense for the ref to be on a WithField also instead (and the text to just change to "progress for image pull")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
pkg/cri/config/config_unix.go
Outdated
| ImageDecryption: ImageDecryption{ | ||
| KeyModel: KeyModelNode, | ||
| }, | ||
| ImagePullProgressDeadline: time.Minute.String(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruiwen-zhao See if you have concern on the 1 minute default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerning if this may break current customer...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i am ok with this default. If there is no data transferred within 1 min, then most likely the image pull is gonna fail.
That said, I am with Qiutong and would like to stay on the safer side. Can we make the default to 2m, which is the default RuntimeRequestTimeout used by kubelet (code), so that this change will not change existing kubernetes behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruiwen-zhao If my memory is correct, the code you mentioned is used to connect to server, not for pulling request. And the kubelet doesn't add timeout for pulling request. That is why kubelet use image_pull_progress_deadline(default=1m) for dockershim.
pkg/cri/config/config.go
Outdated
| } | ||
|
|
||
| // Validation for image_pull_progress_deadline | ||
| if c.ImagePullProgressDeadline != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe the config name should be something like ImagePullProgressTimeout to align with the existing StreamIdleTimeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is it a connection timeout or progress rate minimum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I updated with ImagePullProgressTimeout. PTAL.
pkg/cri/config/config_unix.go
Outdated
| ImageDecryption: ImageDecryption{ | ||
| KeyModel: KeyModelNode, | ||
| }, | ||
| ImagePullProgressDeadline: time.Minute.String(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i am ok with this default. If there is no data transferred within 1 min, then most likely the image pull is gonna fail.
That said, I am with Qiutong and would like to stay on the safer side. Can we make the default to 2m, which is the default RuntimeRequestTimeout used by kubelet (code), so that this change will not change existing kubernetes behavior?
| // active pull requests once. | ||
| func (r *countingReadCloser) Close() error { | ||
| err := r.rc.Close() | ||
| r.once.Do(r.reqReporter.decRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my own education: when would the responseCloser be closed multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, it will be closed only once. Just in case :) because the Close is allowed to call multiple times.
mikebrow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting feature!
Maybe consider making this more about meeting an acceptable progress rate minimum over time than a any progress at all timeout value? Say n MB / d seconds ... first number KB/MB/GB second number ms/sec/min give them two values to set in one string.
pkg/cri/config/config.go
Outdated
| NetNSMountsUnderStateDir bool `toml:"netns_mounts_under_state_dir" json:"netnsMountsUnderStateDir"` | ||
| // ImagePullProgressDeadline is the option that If no pulling progress | ||
| // is made before this deadline, the image pulling will be cancelled. | ||
| ImagePullProgressDeadline string `toml:"image_pull_progress_deadline" json:"imagePullProgressDeadline"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod let's use the description format from stream idle above if we stick to this being a timeout
pkg/cri/config/config.go
Outdated
| } | ||
|
|
||
| // Validation for image_pull_progress_deadline | ||
| if c.ImagePullProgressDeadline != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm is it a connection timeout or progress rate minimum
@mikebrow I was thinking to use content service's status to get the rate. But it consumes more CPU resources to get that status (open 5 file to get one ingest's status). And in this patch, I use |
|
Build succeeded.
|
It seems there are two flaky test cases. |
``` // from containerd#6150 (comment) container_stats_test.go:364: Error Trace: container_stats_test.go:364 Error: Received unexpected error: unexpected stats length github.com/containerd/containerd/integration.TestContainerListStatsWithIdSandboxIdFilter.func4 /home/runner/work/containerd/containerd/integration/container_stats_test.go:371 github.com/containerd/containerd/integration.Eventually /home/runner/work/containerd/containerd/integration/main_test.go:324 github.com/containerd/containerd/integration.TestContainerListStatsWithIdSandboxIdFilter /home/runner/work/containerd/containerd/integration/container_stats_test.go:364 testing.tRunner /opt/hostedtoolcache/go/1.17.3/x64/src/testing/testing.go:1259 runtime.goexit /opt/hostedtoolcache/go/1.17.3/x64/src/runtime/asm_amd64.s:1581 Test: TestContainerListStatsWithIdSandboxIdFilter ``` Add stats result for the purpose of debug. Signed-off-by: Wei Fu <[email protected]>
|
Build succeeded.
|
|
ping @kzys |
| if err != nil { | ||
| return nil, fmt.Errorf("failed to get instance of service %q: %w", s, err) | ||
| } | ||
| if i == nil { | ||
| return nil, fmt.Errorf("instance of service %q not found", s) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a case where i == nil but err == nil? I generally don't handle that case, assuming the function would return either of them, not both.
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the Plugin init function. For now, I agree with you there is no such case.
However, the Registration type doesn't require that InitFn must return non-nil instance.
If the plugin is used to check something without creating new object, it can return nil.
And it is copied from pkg/cri/cri.go. If we require InitFn return non-nil instance, I can file the following pr to update it.
| errCh := make(chan error) | ||
| go func() { | ||
| defer close(errCh) | ||
|
|
||
| _, err := criService.PullImage(ctx, &runtimeapi.PullImageRequest{ | ||
| Image: &runtimeapi.ImageSpec{ | ||
| Image: pullProgressTestImageName, | ||
| }, | ||
| }) | ||
| errCh <- err | ||
| }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using errgroup instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to use select statement with timeout and release the content lock.
If uses errgroup, it looks complicated.
| select { | ||
| case <-time.After(defaultImagePullProgressTimeout * 5): | ||
| // release the lock | ||
| cleanupWriters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be covered by defer above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But I need to release the content lock manually and then the PullImage goroutine can continue to pull and return nil-error.
| CDISpecDirs: []string{"/etc/cdi", "/var/run/cdi"}, | ||
| EnableCDI: false, | ||
| CDISpecDirs: []string{"/etc/cdi", "/var/run/cdi"}, | ||
| ImagePullProgressTimeout: time.Minute.String(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to change the default behavior here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is aligned with kubelet default config. 😂
I think we can set this feature as default behavior, since it is improvement~
Kubelet sends the PullImage request without timeout, because the image size is unknown and timeout is hard to defined. The pulling request might run into 0B/s speed, if containerd can't receive any packet in that connection. For this case, the containerd should cancel the PullImage request. Although containerd provides ingester manager to track the progress of pulling request, for example `ctr image pull` shows the console progress bar, it needs more CPU resources to open/read the ingested files to get status. In order to support progress timeout feature with lower overhead, this patch uses http.RoundTripper wrapper to track active progress. That wrapper will increase active-request number and return the countingReadCloser wrapper for http.Response.Body. Each bytes-read can be count and the active-request number will be descreased when the countingReadCloser wrapper has been closed. For the progress tracker, it can check the active-request number and bytes-read at intervals. If there is no any progress, the progress tracker should cancel the request. NOTE: For each blob data, the containerd will make sure that the content writer is opened before sending http request to the registry. Therefore, the progress reporter can rely on the active-request number. fixed: containerd#4984 Signed-off-by: Wei Fu <[email protected]>
|
Build succeeded.
|
dmcgowan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I still think we can refactor the progress stuff out of CRI but we can think about that interface a little more
Agree. I will open issue to track this. |
fixed: #4984
Signed-off-by: Wei Fu [email protected]