Skip to content

Conversation

@Luap99
Copy link
Member

@Luap99 Luap99 commented Mar 21, 2025

For quay.io it is quite common that if fails with 5xx errors and trying
again is often enough to fix them. As such it would be good if we retry
them by default.

Fixes #2299

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 21, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Luap99
Copy link
Member Author

Luap99 commented Mar 21, 2025

Note this is untested at this point, it is not quite clear to me how to test this best. I guess we need to setup a webserver that responds 5xx all the time.

case *docker.UnexpectedHTTPStatusError:
// Retry on 5xx http server errors, they appear to be quite common in the field.
// https://github.com/containers/common/issues/2299
if e.StatusCode/100 == 5 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 501 Not Implemented ?
  • 505 HTTP Version Not Supported ?
  • 507 Insufficient Storage (“If the request that received this status code was the result of a user action, the request MUST NOT be repeated until it is requested by a separate user action.”)?

This is not ultimately blocking: I’m generally skeptical about blind retries, and worried that they might hurt more than they help. But, also, I would have originally objected to retrying on 502 as insufficiently explicit about retries being valuable, so I wouldn’t trust my own judgement :)

I didn’t now carefully evaluate each individual status code, let’s first discuss whether we want to deal with things at that level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points, I think we can start small and slowly add more codes if we see the need.

502 is the very common one with quay.io that happens all the time. I guess that is just a load balancer retuning it as the backend server went unavailable. I assume that retries would bring you do a different backend and thus commonly work.
I have seen other codes from quay.io but I don't have the logs anymore to know the exact details. From memory I think they were 503 and 504.

I am fine to start with 502 and then we can reconsider others as we encounter them.

Copy link

@adambkaplan adambkaplan Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I recommend checking 5xx errors that tend to be reported by reverse proxies/load balancers if the underlying service is unavailable:

  • 502 Bad Gateway
  • 503 Service Unavailable
  • 504 Gateway Timeout

Retries are more likely to be successful if the backend registry is capable of self-healing.

Copy link
Contributor

@mtrmac mtrmac Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t actually go read the specs, but intuitively that sounds reasonable to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and if we are retrying on the catch-all “something behind a proxy failed” 502, I guess retrying on bare “something failed without a proxy” 500 would be consistent?

It is the case that some hard failures (“unsupported manifest format”) show up as 500… but would that be 500 or 502 when fronted by a reverse proxy / load balancer?

@Luap99 Luap99 changed the title pkg/retry: retry 5xx server errors pkg/retry: retry 50{2,3,4} server errors Mar 24, 2025
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The linter is unhappy, that’s probably unrelated.

Luap99 added 2 commits March 24, 2025 18:05
Includes the exported UnexpectedHTTPStatusError type.

Signed-off-by: Paul Holzinger <[email protected]>
For quay.io it is quite common that if fails with 50{2,3,4} errors and
trying again is often enough to fix them. As such it would be good if
we retry them by default.

Fixes containers#2299

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99 Luap99 marked this pull request as ready for review March 24, 2025 17:06
@Luap99
Copy link
Member Author

Luap99 commented Mar 24, 2025

I did a local e2e test with a reverse proxy in front of the default docker registry container, seems to work fine:

WARN[0000] Failed, retrying in 1s ... (1/3). Error: trying to reuse blob sha256:b66a884aaf08f1c410c61682a7072d68a0d837ba8dc16ff13b9509bdceb32fd2 at destination: pinging container registry localhost:6000: received unexpected HTTP status: 502 Bad Gateway 
Getting image source signatures
WARN[0001] Failed, retrying in 1s ... (2/3). Error: trying to reuse blob sha256:b66a884aaf08f1c410c61682a7072d68a0d837ba8dc16ff13b9509bdceb32fd2 at destination: pinging container registry localhost:6000: received unexpected HTTP status: 502 Bad Gateway 
e8244d37dbbc31ea23908761b9e14e0eb5e8606b58496c82c3c156d3a23a7eed
Getting image source signatures
Copying blob b66a884aaf08 done   | 
Copying blob 5f70bf18a086 done   | 
Copying config b82e560ed5 done   | 
Writing manifest to image destination

@Luap99
Copy link
Member Author

Luap99 commented Mar 27, 2025

@containers/podman-maintainers PTAL

@mheon
Copy link
Member

mheon commented Mar 27, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Mar 27, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 8b6a29c into containers:main Mar 27, 2025
15 checks passed
@Luap99 Luap99 deleted the http-docker-err branch March 27, 2025 19:13
ralphbean added a commit to konflux-ci/build-tasks-dockerfiles that referenced this pull request Jun 5, 2025
The point is to get the new skopeo 1.19.

It includes this fix in containers/common: containers/common#2378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request: Retry if Registry Returns 5xx HTTP Error

4 participants