libcontainer/container_linux: Consider process state (running, zombie, etc.) in runType #1489

wking · 2017-06-14T23:52:20Z

As @avagin pointed out, we're currently failing to distinguish between running (stopped, disk-sleeping, etc.) and dead/zombie processes when determining the container state. This PR rerolls our runType approach to make that distinction.

Now that we need two values from /proc/[pid]/stat, I've rerolled libcontainer/system/proc to use a Stat_t object (like the sys/x/unix Stat_t API). Then we can get both the start-time and process state from the same filesystem read.

Additional details in the commit messages for the curious.

So we can extract more than the start time with a single read. Signed-off-by: W. Trevor King <[email protected]>

wking · 2017-06-15T18:05:09Z

Hooray, Travis is happy :). Can someone more familiar with the checkpoint tooling comment on whether the appropriate post-checkpoint state is Stopped (with a zombie process?).

avagin · 2017-06-20T18:17:24Z

@wking Thank you for the patch. I looked at it without details and it looks good for me. I'm going to review it carefully a bit later.

Can someone more familiar with the checkpoint tooling comment on whether the appropriate post-checkpoint state is Stopped (with a zombie process?).

I am not sure that I understand what do you mean here. Could you elaborate?

wking · 2017-06-20T18:48:52Z

On Tue, Jun 20, 2017 at 11:17:25AM -0700, Andrew Vagin wrote: > Can someone more familiar with the checkpoint tooling comment on > whether the appropriate post-checkpoint state is Stopped (with a > zombie process?). I am not sure that I understand what do you mean here. Could you elaborate?

The change I make to checkpoint_test.go [1] is after a container has been checkpointed (via container.Checkpoint [2]). When the state is checked [3], the container process was a zombie, which, with this PR, means the status was ‘Stopped’ (before this PR, a zombie container process was reported as ‘Running’ [4]). Interestingly, there's an earlier call to container.Checkpoint with different options in that test [5], and in that case the container process is ‘Running’ both before and after this PR [6]. I don't understand container.Checkpoint well enough to know why the call in [2] (but not the call in [5]) seems to be zombi-fying the container process (the difference may be PreDump being set in the earlier call [7]). And since I don't understand why the container process is (at least sometimes) being zombi-fied, I'm not sure how reliable my checkpoint_test.go change [1] is. [1]: https://github.com/opencontainers/runc/pull/1489/files#diff-4dcc5378706c75d50b2b7c827d8546e8L183 [2]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L173 [3]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L178 [4]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L183 [5]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L145 [6]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L155 [7]: https://github.com/opencontainers/runc/blob/035b57895dcd7284a61aacfb9dbfd545a1a425ba/libcontainer/integration/checkpoint_test.go#L141

avagin · 2017-06-20T22:54:16Z

My reproducer for the problem from is still working. You need to stop the init process, I use gdb for that, and then start tests.

[root@fc24 runc]# gdb -p 1
(gdb) 
[1]+  Stopped                 gdb -p 1
[root@fc24 runc]# bats tests/integration/checkpoint.bats 
 ✗ checkpoint and restore
   (in test file tests/integration/checkpoint.bats, line 35)
     `[ "$status" -ne 0 ]' failed
   runc list (status=0):
   ID          PID         STATUS      BUNDLE      CREATED     OWNER
   runc spec (status=0):
   
   runc run -d --console-socket /tmp/console.sock test_busybox (status=0):
   
   runc state test_busybox (status=0):
   {
     "ociVersion": "1.0.0-rc5-dev",
     "id": "test_busybox",
     "pid": 23063,
     "status": "running",
     "bundle": "/tmp/busyboxtest",
     "rootfs": "/tmp/busyboxtest/rootfs",
     "created": "2017-06-20T22:50:35.560498491Z",
     "owner": ""
   }
   runc --criu /usr/sbin/criu checkpoint test_busybox (status=0):
   time="2017-06-21T01:50:35+03:00" level=error msg="container is not destroyed" 
   runc state test_busybox (status=0):
   {
     "ociVersion": "1.0.0-rc5-dev",
     "id": "test_busybox",
     "pid": 23063,
     "status": "running",
     "bundle": "/tmp/busyboxtest",
     "rootfs": "/tmp/busyboxtest/rootfs",
     "created": "2017-06-20T22:50:35.560498491Z",
     "owner": ""
   }
   runc list (status=0):
   ID             PID         STATUS      BUNDLE             CREATED                          OWNER
   test_busybox   23063       running     /tmp/busyboxtest   2017-06-20T22:50:35.560498491Z   root
   runc kill test_busybox KILL (status=0):
   
   Command "eval __runc state 'test_busybox' | grep -q 'stopped'" failed 10 times. Output: 
   runc delete test_busybox (status=1):
   time="2017-06-21T01:50:46+03:00" level=error msg="cannot delete container test_busybox that is not stopped: running\n\n" 
   cannot delete container test_busybox that is not stopped: running

avagin · 2017-06-20T23:01:30Z

Oops, I forgot to compile runc. Actually everything works as expected with this patch.

avagin · 2017-06-20T23:18:22Z

@wking you changes in checkpoint_test.go looks correct.

Interestingly, there's an earlier call to container.Checkpoint with different options in that
test [5], and in that case the container process is ‘Running’ both before and after this PR [6].

container.Checkpoint(preDumpOpts) doesn't stop a container, this call enables memory tracker and dumps memory for processes without stopping them. Then we call container.Checkpoint() in a second time and this call stops processes and dumps them. The big part of memory is already dumped, so we have to dump only changed memory. This optimization allows us to reduce a downtime of a container, when it is migrated from one host to another host: https://criu.org/Iterative_migration

avagin · 2017-06-20T23:18:32Z

LGTM

And convert the various start-time properties from strings to uint64s. This removes all internal consumers of the deprecated GetProcessStartTime function. Signed-off-by: W. Trevor King <[email protected]>

And Stat_t.PID and Stat_t.Name while we're at it. Then use the new .State property in runType to distinguish between running and zombie/dead processes, since kill(2) does not [1]. With this change we no longer claim Running status for zombie/dead processes. I've also removed the kill(2) call from runType. It was originally added in 13841ef (new-api: return the Running state only if the init process is alive, 2014-12-23), but we've been accessing /proc/[pid]/stat since 14e95b2 (Make state detection precise, 2016-07-05, opencontainers#930), and with the /stat access the kill(2) check is redundant. I also don't see much point to the previously-separate doesInitProcessExist, so I've inlined that logic in runType. It would be nice to distinguish between "/proc/[pid]/stat doesn't exist" and errors parsing its contents, but I've skipped that for the moment. The Running -> Stopped change in checkpoint_test.go is because the post-checkpoint process is a zombie, and with this commit zombie processes are Stopped (and no longer Running). [1]: opencontainers#1483 (comment) Signed-off-by: W. Trevor King <[email protected]>

wking · 2017-06-20T23:28:01Z

I just pushed 407e33a → 2bea4c8 fixing a %ds → %d typo. Sorry for the churn :/. @avagin, can you check the update and re-LGTM?

avagin · 2017-06-20T23:36:26Z

LGTM

TomSweeneyRedHat · 2017-06-21T11:16:17Z

libcontainer/container_linux.go

-	exist, err := c.doesInitProcessExist(pid)
-	if !exist || err != nil {
-		return Stopped, err
+	if stat.StartTime != c.initProcessStartTime || stat.State == system.Zombie || stat.State == system.Dead {


Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off? Apologies if I missed it, and not a hold up for this PR, suggestion for later if missing.

Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?

I'm not sure what you mean. Can you give more detail on the case you feel is not covered?

Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?

Ah, I think you mean a unit test where we run this on a dead/zombie process. The checkpoint_test.go change is doing that already, although there's too much going on there for it to be a unit test. If you have another way to manufacture a zombie process for testing, I'm happy to add a more focused test.

I was indeed thinking of a unit test, trying to ensure more code coverage. I unfortunately don't have a good way to emulate a zombie/dead process. I was just thinking of faking it out in a test to change the state of the process. I didn't see that happening for Dead/Zombie in the in the checkpoint_tests.go in particular, only that the process wasn't Stopped. Anywho, no need to address now, especially with this PR, just a point to mull.

TomSweeneyRedHat · 2017-06-21T14:51:22Z

LGTM, shoulda done this when I commented earlier today.

crosbymichael · 2017-06-21T17:23:59Z

LGTM

libcontainer/system/proc: Add Stat and Stat_t

439eaa3

So we can extract more than the start time with a single read. Signed-off-by: W. Trevor King <[email protected]>

wking force-pushed the process-status branch 8 times, most recently from 107bc2c to 407e33a Compare June 15, 2017 17:40

wking mentioned this pull request Jun 20, 2017

libcontainer/specconv/spec_linux: Add support for (no)lazytime #1460

Merged

wking added 2 commits June 20, 2017 16:26

libcontainer: Replace GetProcessStartTime with Stat_t.StartTime

75d98b2

And convert the various start-time properties from strings to uint64s. This removes all internal consumers of the deprecated GetProcessStartTime function. Signed-off-by: W. Trevor King <[email protected]>

wking force-pushed the process-status branch from 407e33a to 2bea4c8 Compare June 20, 2017 23:27

TomSweeneyRedHat reviewed Jun 21, 2017

View reviewed changes

crosbymichael merged commit bd65ef6 into opencontainers:master Jun 21, 2017

wking mentioned this pull request Jun 21, 2017

Update criu to 3.1 #1483

Closed

wking deleted the process-status branch August 10, 2017 18:19

thaJeztah mentioned this pull request Sep 30, 2020

libcontainer/system: remove deprecated GetProcessStartTime #2620

Merged

kolyshkin mentioned this pull request Oct 4, 2021

[1.13.1] fix init race projectatomic/runc#56

Merged

libcontainer/container_linux: Consider process state (running, zombie, etc.) in runType #1489

libcontainer/container_linux: Consider process state (running, zombie, etc.) in runType #1489

Uh oh!

Conversation

wking commented Jun 14, 2017

Uh oh!

wking commented Jun 15, 2017

Uh oh!

avagin commented Jun 20, 2017

Uh oh!

wking commented Jun 20, 2017 via email

Uh oh!

avagin commented Jun 20, 2017

Uh oh!

avagin commented Jun 20, 2017

Uh oh!

avagin commented Jun 20, 2017

Uh oh!

avagin commented Jun 20, 2017 • edited by caniszczyk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 20, 2017

Uh oh!

avagin commented Jun 20, 2017 • edited by caniszczyk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomSweeneyRedHat Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

wking Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

wking Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

TomSweeneyRedHat Jun 21, 2017

Choose a reason for hiding this comment

Uh oh!

TomSweeneyRedHat commented Jun 21, 2017

Uh oh!

crosbymichael commented Jun 21, 2017 • edited by caniszczyk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

avagin commented Jun 20, 2017 •

edited by caniszczyk

Loading

avagin commented Jun 20, 2017 •

edited by caniszczyk

Loading

crosbymichael commented Jun 21, 2017 •

edited by caniszczyk

Loading