cri: handle sandbox/container exit event in parallel by fuweid · Pull Request #4682 · containerd/containerd

fuweid · 2020-10-31T11:40:39Z

The current design is that all the task exit events are put into one queue. The event consumer will consume the event one by one, serially and there is only one consumer. The consumer is required to handle one task in 10 seconds. If timeout, consumer will put it back into backoff queue and retry it later.

If the node has to evict many pods for some reason, like disk pressure, the backlog of handling task exit event will increases. If each one exit event needs 10 seconds, the deletion of pods will take long time.

For this case, I would like to propose to handle task exit event in parallel. When the container/sandbox monitors receive the task exit event, the monitor should handle it first. If it failed, it can put the event into backoff queue and event consumer can try it later.

Signed-off-by: Wei Fu [email protected]

theopenlab-ci · 2020-10-31T11:50:37Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 8m 39s (non-voting)

theopenlab-ci · 2020-10-31T12:17:58Z

Build succeeded.

containerd-build-arm64 : SUCCESS in 6m 00s (non-voting)

fuweid · 2020-12-03T03:28:19Z

ping @AkihiroSuda and @mikebrow

cpuguy83 · 2020-12-08T23:41:00Z

I don't think I understand this change.
It might help to understand:

What is the current behavior
Why is the behavior problematic
Why does this solution help

fuweid · 2020-12-09T03:05:00Z

@cpuguy83

The current design is that all the task exit events are put into one queue. The event consumer will consume the event one by one, serially and there is only one consumer. The consumer is required to handle one task in 10 seconds. If timeout, consumer will put it back into backoff queue and retry it later.

If the node has to evict many pods for some reason, like disk pressure, the backlog of handling task exit event will increases. If each one exit event needs 10 seconds, the deletion of pods will take long time.

For this case, I would like to propose to handle task exit event in parallel. When the container/sandbox monitors receive the task exit event, the monitor should handle it first. If it failed, it can put the event into backoff queue and event consumer can try it later.

Sorry for unclear message. So, does it make sense to you?

mikebrow

I like this alot. Are there any paths left for handleEvent() that could still receive a TaskExit? If not we can probably remove that case..

mikebrow · 2020-12-18T18:08:29Z

the error output here always bothered me :-) if err != store.ErrNotExist wrap the error with can't find? probably better to just say unexpected error retrieving sandbox for TaskExit event

fuweid · 2020-12-24T15:32:55Z

--- FAIL: TestLosetup (0.08s)
    --- FAIL: TestLosetup/RemoveLoopDevicesAssociatedWithImage (0.04s)
Error:         losetup_test.go:96: assertion failed: expected [/dev/loop1] (length 1) to have length 0

Hmm...

fuweid · 2020-12-25T04:34:17Z

The backoff is map and need mutex to lock it. Updated it later

fuweid · 2021-01-09T04:47:32Z

-- FAIL: TestRwLoop (0.06s)
Error:     losetup_linux_test.go:93: write /dev/loop1: no space left on device

hmm....

fuweid · 2021-01-13T03:20:53Z

ping @mikebrow the patch is ready to review. PTAL. Thanks!

mikebrow

LGTM
very nice!

fuweid · 2021-01-22T02:15:14Z

ping @AkihiroSuda and @cpuguy83 ~

The event monitor handles exit events one by one. If there is something wrong about deleting task, it will slow down the terminating Pods. In order to reduce the impact, the exit event watcher should handle exit event separately. If it failed, the watcher should put it into backoff queue and retry it. Signed-off-by: Wei Fu <[email protected]>

fuweid · 2021-01-24T07:05:24Z

updated the error message for go error message convention.

dmcgowan

LGTM

fuweid force-pushed the cri-handle-exit-event-separate branch from eb13925 to 939b159 Compare October 31, 2020 12:10

fuweid requested review from AkihiroSuda and mikebrow November 7, 2020 08:17

crosbymichael reviewed Nov 11, 2020

View reviewed changes

Comment thread pkg/cri/server/events.go Outdated

fuweid changed the title ~~cri: handle sandbox/container exit event separately~~ cri: handle sandbox/container exit event concurrently Dec 9, 2020

fuweid changed the title ~~cri: handle sandbox/container exit event concurrently~~ cri: handle sandbox/container exit event in parallel Dec 9, 2020

mikebrow reviewed Dec 18, 2020

View reviewed changes

fuweid force-pushed the cri-handle-exit-event-separate branch from 939b159 to 9760f88 Compare December 24, 2020 15:15

fuweid force-pushed the cri-handle-exit-event-separate branch 2 times, most recently from c6e6957 to 4de9c3d Compare January 9, 2021 04:27

fuweid force-pushed the cri-handle-exit-event-separate branch from 4de9c3d to e024dde Compare January 11, 2021 15:36

containerd deleted a comment from theopenlab-ci Bot Jan 11, 2021

mikebrow approved these changes Jan 13, 2021

View reviewed changes

fuweid added area/cri Container Runtime Interface (CRI) kind/performance labels Jan 21, 2021

mxpv approved these changes Jan 22, 2021

View reviewed changes

fuweid force-pushed the cri-handle-exit-event-separate branch from e024dde to e56de63 Compare January 24, 2021 07:01

containerd deleted a comment from theopenlab-ci Bot Jan 24, 2021

dmcgowan approved these changes Jan 24, 2021

View reviewed changes

dmcgowan merged commit f615c58 into containerd:master Jan 24, 2021

dmcgowan added the impact/changelog label Jan 24, 2021

fuweid deleted the cri-handle-exit-event-separate branch January 24, 2021 07:26

Conversation

fuweid commented Oct 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theopenlab-ci Bot commented Oct 31, 2020

Uh oh!

theopenlab-ci Bot commented Oct 31, 2020

Uh oh!

Uh oh!

fuweid commented Dec 3, 2020

Uh oh!

cpuguy83 commented Dec 8, 2020

Uh oh!

fuweid commented Dec 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mikebrow Dec 18, 2020

Choose a reason for hiding this comment

Uh oh!

fuweid Dec 25, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fuweid commented Dec 24, 2020

Uh oh!

fuweid commented Dec 25, 2020

Uh oh!

fuweid commented Jan 9, 2021

Uh oh!

fuweid commented Jan 13, 2021

Uh oh!

mikebrow left a comment

Choose a reason for hiding this comment

Uh oh!

fuweid commented Jan 22, 2021

Uh oh!

fuweid commented Jan 24, 2021

Uh oh!

dmcgowan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fuweid commented Oct 31, 2020 •

edited

Loading

fuweid commented Dec 9, 2020 •

edited

Loading