-
Notifications
You must be signed in to change notification settings - Fork 225
pkg/retry: retry 50{2,3,4} server errors #2378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Luap99 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Note this is untested at this point, it is not quite clear to me how to test this best. I guess we need to setup a webserver that responds 5xx all the time. |
pkg/retry/retry.go
Outdated
| case *docker.UnexpectedHTTPStatusError: | ||
| // Retry on 5xx http server errors, they appear to be quite common in the field. | ||
| // https://github.com/containers/common/issues/2299 | ||
| if e.StatusCode/100 == 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 501 Not Implemented ?
- 505 HTTP Version Not Supported ?
- 507 Insufficient Storage (“If the request that received this status code was the result of a user action, the request MUST NOT be repeated until it is requested by a separate user action.”)?
This is not ultimately blocking: I’m generally skeptical about blind retries, and worried that they might hurt more than they help. But, also, I would have originally objected to retrying on 502 as insufficiently explicit about retries being valuable, so I wouldn’t trust my own judgement :)
I didn’t now carefully evaluate each individual status code, let’s first discuss whether we want to deal with things at that level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair points, I think we can start small and slowly add more codes if we see the need.
502 is the very common one with quay.io that happens all the time. I guess that is just a load balancer retuning it as the backend server went unavailable. I assume that retries would bring you do a different backend and thus commonly work.
I have seen other codes from quay.io but I don't have the logs anymore to know the exact details. From memory I think they were 503 and 504.
I am fine to start with 502 and then we can reconsider others as we encounter them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I recommend checking 5xx errors that tend to be reported by reverse proxies/load balancers if the underlying service is unavailable:
- 502 Bad Gateway
- 503 Service Unavailable
- 504 Gateway Timeout
Retries are more likely to be successful if the backend registry is capable of self-healing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t actually go read the specs, but intuitively that sounds reasonable to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… and if we are retrying on the catch-all “something behind a proxy failed” 502, I guess retrying on bare “something failed without a proxy” 500 would be consistent?
It is the case that some hard failures (“unsupported manifest format”) show up as 500… but would that be 500 or 502 when fronted by a reverse proxy / load balancer?
mtrmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The linter is unhappy, that’s probably unrelated.
Includes the exported UnexpectedHTTPStatusError type. Signed-off-by: Paul Holzinger <[email protected]>
For quay.io it is quite common that if fails with 50{2,3,4} errors and
trying again is often enough to fix them. As such it would be good if
we retry them by default.
Fixes containers#2299
Signed-off-by: Paul Holzinger <[email protected]>
|
I did a local e2e test with a reverse proxy in front of the default docker registry container, seems to work fine: |
|
@containers/podman-maintainers PTAL |
|
/lgtm |
The point is to get the new skopeo 1.19. It includes this fix in containers/common: containers/common#2378
For quay.io it is quite common that if fails with 5xx errors and trying
again is often enough to fix them. As such it would be good if we retry
them by default.
Fixes #2299