Skip to content

shim: move event context timeout to publsher#4412

Merged
fuweid merged 1 commit intocontainerd:masterfrom
cpuguy83:shim2_event_cancelled
Jul 22, 2020
Merged

shim: move event context timeout to publsher#4412
fuweid merged 1 commit intocontainerd:masterfrom
cpuguy83:shim2_event_cancelled

Conversation

@cpuguy83
Copy link
Copy Markdown
Member

@cpuguy83 cpuguy83 commented Jul 20, 2020

Before this change, if an event fails to send on the first attempt,
subsequent attempts will fail with context.Cancelled because the the
caller of publish passes a cancellable timeout, which the publisher uses
to send the event.

The publisher returns immediately if the send fails, but adds the event
to an async queue to try again.
Meanwhile the caller will return cancelling the context.

Additionally, subsequent attempts may fail to send because the timeout
was expected to be for a single request but the queue sleeps for
attempt*time.Second.

In the shim service, the timeout was set to 5s, which means the send
will fail with context.DeadlineExceeded before it reaches maxRequeue
(which is currently 5).

This change moves the timeout to the publisher so each send attempt gets
its own timeout.

Fixes #4400

@cpuguy83
Copy link
Copy Markdown
Member Author

I have a test for this, but since the old behavior is racey it is difficult to make the test fail with the old code without arbitrary sleeps.

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Jul 20, 2020

Build succeeded.

Comment thread runtime/v2/shim/publisher.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctx isn't used in between here and next Forward, this could avoid the defer by just calling cancel after.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Before this change, if an event fails to send on the first attempt,
subsequent attempts will fail with context.Cancelled because the the
caller of publish passes a cancellable timeout, which the publisher uses
to send the event.

The publisher returns immediately if the send fails, but adds the event
to an async queue to try again.
Meanwhile the caller will return cancelling the context.

Additionally, subsequent attempts may fail to send because the timeout
was expected to be for a single request but the queue sleeps for
`attempt*time.Second`.

In the shim service, the timeout was set to 5s, which means the send
will fail with context.DeadlineExceeded before it reaches `maxRequeue`
(which is currently 5).

This change moves the timeout to the publisher so each send attempt gets
its own timeout.

Signed-off-by: Brian Goff <[email protected]>
@cpuguy83 cpuguy83 force-pushed the shim2_event_cancelled branch from ac78efc to d7b9cb0 Compare July 21, 2020 00:51
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented Jul 21, 2020

Build succeeded.

@crosbymichael
Copy link
Copy Markdown
Member

LGTM

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shim cleanup events fail due to context.Cancelled

4 participants