fix: ImagePull should close http connection if there is no available data to read.#9369
Conversation
|
Skipping CI for Draft Pull Request. |
|
ping @samuelkarp @ruiwen-zhao @thaJeztah please help to review 5-minutes is from https://github.com/kubernetes/kubernetes/blob/1635c380b26a1d8cc25d36e9feace9797f4bae3c/cluster/gce/util.sh#L882 and the information provided by andyzhangx cc @dmcgowan @ktock since I change the remote/dockers code path. Please take a look. thanks |
|
I'll try to have a look at this one, but not deeply familiar, so don't let that keep others from beating me to it 😂
It's fine to make that |
| EnableCDI: false, | ||
| CDISpecDirs: []string{"/etc/cdi", "/var/run/cdi"}, | ||
| ImagePullProgressTimeout: time.Minute.String(), | ||
| ImagePullProgressTimeout: (5 * time.Minute).String(), |
There was a problem hiding this comment.
Just a quick blurb, per my comment on the other PR, it may be good to document where this timeout came from (why we chose this value); which could be a const somewhere that has the documentation on it.
There was a problem hiding this comment.
Sorry, I forgot the original comment. Updated. Please take a look.
| return n, nil | ||
| } | ||
| } | ||
| // close the connection if there is no more available data |
There was a problem hiding this comment.
I think we should have a note about the reason why we need to immediately release the resource (i.e. caller can rely on the status of the reader for monitoring purpose, etc.).
I'm also a bit curious about whether it's possible to run src.Close() before content.Commit(dst) to make sure releasing resources before the long task.
defer src.Close() content.Commit(dst)
There was a problem hiding this comment.
I think we should have a note about the reason why we need to immediately release the resource (i.e. caller can rely on the status of the reader for monitoring purpose, etc.).
Thanks! Updated.
I'm also a bit curious about whether it's possible to run src.Close() before content.Commit(dst) to make sure releasing resources before the long task.
I was thinking about this. But the httpReadSeeker can reconnect if there is unexpected EOF. And the pusher has ErrReset case and the content.Copy helper supports the retry. So, we can't just close the src here.
containerd/remotes/docker/pusher_test.go
Line 164 in 7deb68f
Line 183 in 7deb68f
eea52b3 to
2f33a9a
Compare
| // | ||
| // NOTE: | ||
| // | ||
| // This function is ported from kubelet/dockershim's --image-pull-progress-deadline. |
There was a problem hiding this comment.
confused about this line of comment. which function is it referring?
There was a problem hiding this comment.
Updated. Please take a look.
| // The CRI's imagePullProgressTimeout relies on responseBody.Close to | ||
| // update the process monitor's status. If the err is io.EOF, close | ||
| // the connection since there is no more available data. | ||
| if err == io.EOF { |
There was a problem hiding this comment.
feels like it's better to be a else if here?
2f33a9a to
584dc70
Compare
|
CI fails and it needs #9392 |
The new active request is filed and there is no bytes read yet when the progress reporter just wakes up. If the timeout / 2 is less than the minPullProgressReportInternal, it's easy to file false alert. We should remove the minPullProgressReportInternal limit. Fixes: containerd#8024 Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Close connection if no more data. It's to fix false alert filed by image
pull progress.
```
dst = OpenWriter (--> Content Store)
src = Fetch
Open (--> Registry)
Mark it as active request
Copy(dst, src) (--> Keep updating total received bytes)
^
| (Active Request > 0, but total received bytes won't be updated)
v
defer src.Close()
content.Commit(dst)
```
Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem [1][1], the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.
The original 1-minute timeout[2][2] is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.
[1]: https://lwn.net/Articles/842385/
[2]: https://github.com/kubernetes/kubernetes/blob/release-1.23/pkg/kubelet/config/flags.go#L45-L48
Signed-off-by: Wei Fu <[email protected]>
584dc70 to
80dd779
Compare
Close connection if no more data. It's to fix false alert filed by image
pull progress.
Before migrating to transfer service, CRI plugin doesn't limit global
concurrent downloads for ImagePulls. Each ImagePull requests have 3 concurrent
goroutines to download blob and 1 goroutine to unpack blob. Like ext4
filesystem 1, the fsync from content.Commit may sync unrelated dirty pages
into disk. The host is running under IO pressure, and then the content.Commit
will take long time and block other goroutines. If httpreadseeker
doesn't close the connection after io.EOF, this connection will be
considered as active. The pull progress reporter reports there is no
bytes transfered and cancels the ImagePull.
The original 1-minute timeout2 is from kubelet settting. Since CRI-plugin
can't limit the total concurrent downloads, this patch is to update 1-minute
to 5-minutes to prevent from unexpected cancel.
Fixes: #8024
Alternative: #9347
Signed-off-by: Wei Fu [email protected]