Skip to content

remotes/docker: Only return "already exists" on push when the upload was sucessful#5276

Merged
fuweid merged 1 commit intocontainerd:masterfrom
aaronlehmann:tracker-false-positives
Apr 7, 2021
Merged

remotes/docker: Only return "already exists" on push when the upload was sucessful#5276
fuweid merged 1 commit intocontainerd:masterfrom
aaronlehmann:tracker-false-positives

Conversation

@aaronlehmann
Copy link
Copy Markdown

The (dockerPusher).Push method uses a StatusTracker to check if an
upload already happened, before repeating the upload. However, there is
no provision for failure handling. If a PUT request returns an error,
the StatusTracker will still see the upload as if it happened
successfully. Add a status boolean so that only successful uploads
short-circuit Push.

Signed-off-by: Aaron Lehmann [email protected]

@k8s-ci-robot
Copy link
Copy Markdown

Hi @aaronlehmann. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@aaronlehmann
Copy link
Copy Markdown
Author

Somewhat related to this PR: Push uses doWithRetries, but because the body comes from a pipe that isn't seekable, an attempt to retry the upload will fail. I'm not sure what purpose doWithRetries serves in this case, since the architecture of Push doesn't seem like it really supports retries. It may make sense to redesign this (or stop attempting retries in Push). I ran into the issue in this PR because I originally wanted to adjust the retry logic for Push, but found I had to go one level up to the library user to get usable retries.

aaronlehmann pushed a commit to aaronlehmann/buildkit that referenced this pull request Mar 27, 2021
Some registries can be flaky and return intermittent 5xx errors. This
change allows those errors to be retried, similarly to network-level
errors.

Note that this needs the upstream containerd fix
containerd/containerd#5276 to work reliably.

This was tested with a registry that was modified to return 504 on every
other manifest PUT. Without the change, exports to the registry fail
every other attempt.  With the change and the related containerd change,
exports to the registry always succeed.

Signed-off-by: Aaron Lehmann <[email protected]>
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Mar 27, 2021

Build succeeded.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Mar 27, 2021

Build succeeded.

@thaJeztah
Copy link
Copy Markdown
Member

Hi 👋 @aaronlehmann long time no see 🤗

@dmcgowan
Copy link
Copy Markdown
Member

@aaronlehmann related to doWithRetries, I agree that part could be improved, especially the returned errors. It is designed right now to just fail on a subsequent request after a body has already been requested from a previous request, I'm not sure there is a case today where that would end up doing anything though. There have probably been some improvements in Go that allow us to clean this up further too.

What has been your experience with push here though? This code flow hasn't been tested extensively outside of buildkit use cases

@aaronlehmann
Copy link
Copy Markdown
Author

What has been your experience with push here though? This code flow hasn't been tested extensively outside of buildkit use cases

This came up in a buildkit context.

@aaronlehmann
Copy link
Copy Markdown
Author

@dmcgowan: Would you happen to be the right person to review this? It would be great to get it upstream!

@thaJeztah
Copy link
Copy Markdown
Member

@aaronlehmann I saw your PR in BuildKit linking to this one (moby/buildkit#2043) was to handle 5xx errors; is this fix related to / trying to solve the same issue as #5288 ?

/cc @kzys

@k8s-ci-robot k8s-ci-robot requested a review from kzys April 2, 2021 08:32
@aaronlehmann
Copy link
Copy Markdown
Author

No, I wasn't aware of that issue. The problem I was trying to solve was flaky infrastructure causing sporadic 5xx errors which caused pushes to fail.

@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Apr 2, 2021

@dmcgowan: Would you happen to be the right person to review this? It would be great to get it upstream!

Yes, but I am still trying to understand this state a little better. Completed seems like it might share some state with committed, these states don't transfer well from the registry API to content API

@aaronlehmann
Copy link
Copy Markdown
Author

The issue here was that the check near the beginning of Push was only checking the number of bytes sent. If we had sent N bytes of an N byte blob to the registry, we assume the registry has that blob, but that's a bad assumption to make. What if we sent those bytes but the registry returned an error and didn't commit the blob? That's the condition I'm trying to handle.

I'm fine with renaming it to Committed if you think that's a better description.

@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Apr 5, 2021

Committed would be a better field name.

Why I think it still doesn't transfer well is because the content API works a little differently than this. The Status type is intended to represent an active ingest. The "more correct" use of the content API would mean the upload should not check the size at all, but rather continue the process as long as there is a status object returned. After the upload is successful, the status should be removed and the "already exists" would happen via calling Info on the content.Manager interface.

My only reservation is related to making any changes to content.Status. For this particular change, I do think the change should occur inside the Commit method (there is one related TODO there). I also think it may make sense to just keep a separate record of the committed Info objects to the registry. This would be a little more work than just updating the status, but long term I think it is the right way to integrate with the interface. So the changes would mean the tracker would support removing status and keeping an Info cache (just a cache since this could also be checked via Head to the registry).

@aaronlehmann aaronlehmann force-pushed the tracker-false-positives branch from aea6a5f to 4ba9a24 Compare April 5, 2021 17:17
@aaronlehmann
Copy link
Copy Markdown
Author

Thanks, that makes sense. I've modified this to rename the field to Committed, move it from content.Status to docker.Status, and make the update in the Commit method.

I'm not completely clear on how the Info cache would fit into this. It looks to me like the pusher code is making HTTP requests directly rather than using the content.Manager abstraction. Are you suggesting that dockerPusher should be refactored to use content.Manager?

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 5, 2021

Build succeeded.

@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Apr 5, 2021

@aaronlehmann actually this works great, thanks!

LGTM

merge might be on hold still until CI issues resolved

@aaronlehmann aaronlehmann force-pushed the tracker-false-positives branch from 4ba9a24 to cabdd7f Compare April 5, 2021 19:29
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 5, 2021

Build succeeded.

@aaronlehmann aaronlehmann force-pushed the tracker-false-positives branch from cabdd7f to 6015315 Compare April 6, 2021 16:17
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 6, 2021

Build succeeded.

@aaronlehmann
Copy link
Copy Markdown
Author

Is there a known issue with CI? I've tried a few times since yesterday and continue to see these failures.

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mikebrow
Copy link
Copy Markdown
Member

mikebrow commented Apr 7, 2021

Is there a known issue with CI? I've tried a few times since yesterday and continue to see these failures.

yeah rebase and you should be good now

@aaronlehmann aaronlehmann force-pushed the tracker-false-positives branch from 6015315 to 58bfb14 Compare April 7, 2021 01:46
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 7, 2021

Build succeeded.

…was successful

The `(dockerPusher).Push` method uses a `StatusTracker` to check if an
upload already happened, before repeating the upload. However, there is
no provision for failure handling. If a PUT request returns an error,
the `StatusTracker` will still see the upload as if it happened
successfully. Add a status boolean so that only successful uploads
short-circuit `Push`.

Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann aaronlehmann force-pushed the tracker-false-positives branch from 58bfb14 to 4c1fa57 Compare April 7, 2021 02:45
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Apr 7, 2021

Build succeeded.

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fuweid fuweid merged commit 85041ff into containerd:master Apr 7, 2021
tonistiigi pushed a commit to tonistiigi/buildkit that referenced this pull request Apr 22, 2021
Some registries can be flaky and return intermittent 5xx errors. This
change allows those errors to be retried, similarly to network-level
errors.

Note that this needs the upstream containerd fix
containerd/containerd#5276 to work reliably.

This was tested with a registry that was modified to return 504 on every
other manifest PUT. Without the change, exports to the registry fail
every other attempt.  With the change and the related containerd change,
exports to the registry always succeed.

Signed-off-by: Aaron Lehmann <[email protected]>
(cherry picked from commit d3b96f4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants