Skip to content

Conversation

@yyb196
Copy link
Contributor

@yyb196 yyb196 commented Oct 19, 2023

I ran into a problem that pulling image report error:

 failed open: server message: invalid_token: authorization failed

And kubelet retry pulling the next time, it just successed.

After look into the detail I found that the image is a GPU image,
and we limit concurrent pulling number to 4, the token fetch from
registry will timeout in 10 minutes. when the very later layer get their
turn to download their blob, the token has become invalid.

Containerd has the policy to retry pulling when 401 returned from
the registry but if invalid authorization error returned, the cached
token didn't removed, and the next retry just don't work.

@k8s-ci-robot
Copy link

Hi @yyb196. Thanks for your PR.

I'm waiting for a containerd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yyb196 yyb196 force-pushed the fix-invalid-token branch from 21ca509 to 72ded46 Compare October 20, 2023 02:04
@ktock
Copy link
Member

ktock commented Oct 20, 2023

This seems to be related to #8735 .

@yyb196
Copy link
Contributor Author

yyb196 commented Oct 20, 2023

@ktock may be we encounter the same issue, but provide two different solutions. the solution of 8735 depend on that registry return the expire time of token. we better off to fix the retry policy in containerd, since it is more compatible to the existing registries.

@dmcgowan
Copy link
Member

dmcgowan commented Oct 20, 2023

@ktock @yyb196 I think we should probably take both solution. The other one prevents using tokens when we know are expired and this one does better handling of rejected tokens.

For this one I wonder if it is worth a debug log to explain why the token was rejected using the error message (limiting the length of the message for malformed error messages).

for _, c := range auth.ParseAuthHeader(last.Header) {
if c.Scheme == auth.BearerAuth {
if err := invalidAuthorization(c, responses); err != nil {
if retry, err := invalidAuthorization(c, responses); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will one retry with no delay between be sufficient to get over the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thin so, as the method name invalidAuthorization hinted, the registry server has found the auth invalid, retry with the same token is useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmerkes Have I explained your question? this pr still need your approvement.

@yyb196 yyb196 force-pushed the fix-invalid-token branch 2 times, most recently from 54afc07 to 4adce5c Compare October 23, 2023 02:22
@yyb196
Copy link
Contributor Author

yyb196 commented Oct 23, 2023

For this one I wonder if it is worth a debug log to explain why the token was rejected using the error message (limiting the length of the message for malformed error messages).

@dmcgowan As your suggestion, I have added a debug log with a length limited error msg received from registry server.

Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple nits

if n == 1 || (n > 1 && !sameRequest(responses[n-2].Request, responses[n-1].Request)) {
return nil
limitedErr := errStr
if len(limitedErr) > 32 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 would to fit a full error message

if len(limitedErr) > 32 {
limitedErr = limitedErr[:32] + "..."
}
log.G(ctx).WithField("error", limitedErr).Debug("need retry authorization")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"authorization error using bearer token, retrying"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified, PTALA

@dmcgowan dmcgowan added cherry-pick/1.6.x cherry-pick/1.7.x Change to be cherry picked to release/1.7 branch labels Oct 24, 2023
@yyb196 yyb196 force-pushed the fix-invalid-token branch from 4adce5c to 09e4051 Compare October 25, 2023 02:31
}

func invalidAuthorization(c auth.Challenge, responses []*http.Response) error {
func invalidAuthorization(ctx context.Context, c auth.Challenge, responses []*http.Response) (retry bool, _ error) {
Copy link
Member

@ktock ktock Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: retry seems to be always false so maybe it's unneeded? (EDIT: true case doesn't seem to use retry)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is to hint what this return value means

@dmcgowan dmcgowan added this pull request to the merge queue Oct 30, 2023
Merged via the queue into containerd:main with commit 6021103 Oct 30, 2023
@zycupup
Copy link

zycupup commented Apr 8, 2024

Is this PR already cherry-picked into 1.6.x and 1.7.x?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-picked/1.6.x PR commits are cherry-picked into release/1.6 branch cherry-picked/1.7.x PR commits are cherry-picked into release/1.7 branch needs-ok-to-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants