Skip to content

Fix API forward events for shims#3204

Merged
dmcgowan merged 3 commits intocontainerd:masterfrom
crosbymichael:fix-forward
Apr 11, 2019
Merged

Fix API forward events for shims#3204
dmcgowan merged 3 commits intocontainerd:masterfrom
crosbymichael:fix-forward

Conversation

@crosbymichael
Copy link
Copy Markdown
Member

@crosbymichael crosbymichael commented Apr 10, 2019

Fixes #3203

Signed-off-by: Michael Crosby [email protected]

Signed-off-by: Michael Crosby <[email protected]>
@crosbymichael crosbymichael added this to the 1.3 milestone Apr 10, 2019
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #3204 into master will decrease coverage by 0.03%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3204      +/-   ##
==========================================
- Coverage   45.19%   45.16%   -0.04%     
==========================================
  Files         111      111              
  Lines       12009    12017       +8     
==========================================
  Hits         5428     5428              
- Misses       5746     5754       +8     
  Partials      835      835
Flag Coverage Δ
#linux 49.25% <0%> (-0.04%) ⬇️
#windows 40.43% <0%> (-0.04%) ⬇️
Impacted Files Coverage Δ
runtime/v2/shim/shim.go 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 475619c...4ba756e. Read the comment docs.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Apr 10, 2019

Codecov Report

Merging #3204 into master will decrease coverage by 0.29%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #3204     +/-   ##
=========================================
- Coverage   45.14%   44.85%   -0.3%     
=========================================
  Files         111      113      +2     
  Lines       12024    12102     +78     
=========================================
  Hits         5428     5428             
- Misses       5761     5839     +78     
  Partials      835      835
Flag Coverage Δ
#linux 48.9% <0%> (-0.33%) ⬇️
#windows 40.1% <0%> (-0.31%) ⬇️
Impacted Files Coverage Δ
runtime/v2/shim/publisher.go 0% <0%> (ø)
runtime/v2/shim/shim_windows.go 38.5% <0%> (-0.68%) ⬇️
runtime/v2/shim/dialer.go 0% <0%> (ø)
runtime/v2/shim/shim.go 0% <0%> (ø) ⬆️
runtime/v2/shim/shim_unix.go 0% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5703f41...047348e. Read the comment docs.

@crosbymichael crosbymichael force-pushed the fix-forward branch 2 times, most recently from 25c14ed to 231543d Compare April 10, 2019 19:39
Copy link
Copy Markdown
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread runtime/v2/runc/v2/service.go Outdated
Comment thread runtime/v2/shim/shim.go Outdated
Shims no longer call `os.Exit` but close the context on shutdown so that
events and other resources have hit the `defer`s.

Signed-off-by: Michael Crosby <[email protected]>
Copy link
Copy Markdown
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way. I see.

Comment thread runtime/v2/shim/shim.go Outdated
@Random-Liu
Copy link
Copy Markdown
Member

Random-Liu commented Apr 10, 2019

I just hit an issue in shim v1 that I only see TaskDelete event but no TaskExit event.

I think it is just that the shim process is killed before TaskExit event from shim is sent out.

This PR should make it better for shim v2 with graceful shim termination, but we definitely shouldn't continue relying on TaskExit in the cri plugin.

Filed containerd/cri#1132 and will try to address it as soon as possible.

@Random-Liu
Copy link
Copy Markdown
Member

Random-Liu commented Apr 11, 2019

@crosbymichael Found another issue when working on #3199.

The shim cleanup in loadTasks always fail with this PR:

WARN[2019-04-11T00:59:40.488432866-07:00] cleaning up after shim disconnected           id=45db5ac411f63126ef3286fe4d79afdc3f3cf9c8fcb6026def020ef945a5935d namespace=k8s.io
INFO[2019-04-11T00:59:40.488462107-07:00] cleaning up dead shim                        
WARN[2019-04-11T00:59:40.493553189-07:00] failed to clean up after shim disconnected    error="io.containerd.runc.v1: dial unix /run/containerd/containerd.sock.ttrpc: connect: connection refused\n: exit status 1" id=45db5ac411f63126ef3286fe4d79afdc3f3cf9c8fcb6026def020ef945a5935d namespace=k8s.io

The problem is that the task service is started before the event grpc service.

@crosbymichael
Copy link
Copy Markdown
Member Author

@Random-Liu ok, i know what the issue is.

Signed-off-by: Michael Crosby <[email protected]>
}

func (l *remoteEventsPublisher) Publish(ctx context.Context, topic string, event events.Event) error {
client, err := l.dialer.Get()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So event publish will fail if ttrpc event service is not up yet?

#3193 will probably be able to cover that.

@Random-Liu
Copy link
Copy Markdown
Member

Random-Liu commented Apr 11, 2019

Verified that #3204 (comment) is fixed.

DEBU[2019-04-11T10:40:35.507193010-07:00] loading tasks in namespace                    namespace=k8s.io
WARN[2019-04-11T10:40:35.507404450-07:00] cleaning up after shim disconnected           id=63d6638d17014c040f4a979bfd854fb541deb116e785c7e1332c1c7bd47b783e namespace=k8s.io
INFO[2019-04-11T10:40:35.507420039-07:00] cleaning up dead shim                        
DEBU[2019-04-11T10:40:35.611805828-07:00] garbage collected                             d=4.998416ms
WARN[2019-04-11T10:40:35.676519575-07:00] cleanup warnings time="2019-04-11T10:40:35-07:00" level=info msg="starting signal loop" namespace=k8s.io pid=75740 

However, event publish may still fail before event ttrpc server comes up. The chance should be slim, and retry introduced in #3193 should be able to cover most cases.

And going through all these, I'm 100% sure that we won't rely on events in the CRI plugin any more. :P

@dmcgowan
Copy link
Copy Markdown
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants