Skip to content

Conversation

@wking
Copy link
Contributor

@wking wking commented Jun 14, 2017

As @avagin pointed out, we're currently failing to distinguish between running (stopped, disk-sleeping, etc.) and dead/zombie processes when determining the container state. This PR rerolls our runType approach to make that distinction.

Now that we need two values from /proc/[pid]/stat, I've rerolled libcontainer/system/proc to use a Stat_t object (like the sys/x/unix Stat_t API). Then we can get both the start-time and process state from the same filesystem read.

Additional details in the commit messages for the curious.

So we can extract more than the start time with a single read.

Signed-off-by: W. Trevor King <[email protected]>
@wking wking force-pushed the process-status branch 8 times, most recently from 107bc2c to 407e33a Compare June 15, 2017 17:40
@wking
Copy link
Contributor Author

wking commented Jun 15, 2017

Hooray, Travis is happy :). Can someone more familiar with the checkpoint tooling comment on whether the appropriate post-checkpoint state is Stopped (with a zombie process?).

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

@wking Thank you for the patch. I looked at it without details and it looks good for me. I'm going to review it carefully a bit later.

Can someone more familiar with the checkpoint tooling comment on whether the appropriate post-checkpoint state is Stopped (with a zombie process?).

I am not sure that I understand what do you mean here. Could you elaborate?

@wking
Copy link
Contributor Author

wking commented Jun 20, 2017 via email

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

My reproducer for the problem from is still working. You need to stop the init process, I use gdb for that, and then start tests.

[root@fc24 runc]# gdb -p 1
(gdb) 
[1]+  Stopped                 gdb -p 1
[root@fc24 runc]# bats tests/integration/checkpoint.bats 
 ✗ checkpoint and restore
   (in test file tests/integration/checkpoint.bats, line 35)
     `[ "$status" -ne 0 ]' failed
   runc list (status=0):
   ID          PID         STATUS      BUNDLE      CREATED     OWNER
   runc spec (status=0):
   
   runc run -d --console-socket /tmp/console.sock test_busybox (status=0):
   
   runc state test_busybox (status=0):
   {
     "ociVersion": "1.0.0-rc5-dev",
     "id": "test_busybox",
     "pid": 23063,
     "status": "running",
     "bundle": "/tmp/busyboxtest",
     "rootfs": "/tmp/busyboxtest/rootfs",
     "created": "2017-06-20T22:50:35.560498491Z",
     "owner": ""
   }
   runc --criu /usr/sbin/criu checkpoint test_busybox (status=0):
   time="2017-06-21T01:50:35+03:00" level=error msg="container is not destroyed" 
   runc state test_busybox (status=0):
   {
     "ociVersion": "1.0.0-rc5-dev",
     "id": "test_busybox",
     "pid": 23063,
     "status": "running",
     "bundle": "/tmp/busyboxtest",
     "rootfs": "/tmp/busyboxtest/rootfs",
     "created": "2017-06-20T22:50:35.560498491Z",
     "owner": ""
   }
   runc list (status=0):
   ID             PID         STATUS      BUNDLE             CREATED                          OWNER
   test_busybox   23063       running     /tmp/busyboxtest   2017-06-20T22:50:35.560498491Z   root
   runc kill test_busybox KILL (status=0):
   
   Command "eval __runc state 'test_busybox' | grep -q 'stopped'" failed 10 times. Output: 
   runc delete test_busybox (status=1):
   time="2017-06-21T01:50:46+03:00" level=error msg="cannot delete container test_busybox that is not stopped: running\n\n" 
   cannot delete container test_busybox that is not stopped: running

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

Oops, I forgot to compile runc. Actually everything works as expected with this patch.

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

@wking you changes in checkpoint_test.go looks correct.

Interestingly, there's an earlier call to container.Checkpoint with different options in that
test [5], and in that case the container process is ‘Running’ both before and after this PR [6].

container.Checkpoint(preDumpOpts) doesn't stop a container, this call enables memory tracker and dumps memory for processes without stopping them. Then we call container.Checkpoint() in a second time and this call stops processes and dumps them. The big part of memory is already dumped, so we have to dump only changed memory. This optimization allows us to reduce a downtime of a container, when it is migrated from one host to another host: https://criu.org/Iterative_migration

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

LGTM

Approved with PullApprove

wking added 2 commits June 20, 2017 16:26
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.

Signed-off-by: W. Trevor King <[email protected]>
And Stat_t.PID and Stat_t.Name while we're at it.  Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1].  With this change
we no longer claim Running status for zombie/dead processes.

I've also removed the kill(2) call from runType.  It was originally
added in 13841ef (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2 (Make state detection precise,
2016-07-05, opencontainers#930), and with the /stat access the kill(2) check is
redundant.

I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.

It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.

The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).

[1]: opencontainers#1483 (comment)

Signed-off-by: W. Trevor King <[email protected]>
@wking
Copy link
Contributor Author

wking commented Jun 20, 2017

I just pushed 407e33a2bea4c8 fixing a %ds%d typo. Sorry for the churn :/. @avagin, can you check the update and re-LGTM?

@avagin
Copy link
Contributor

avagin commented Jun 20, 2017

LGTM

Approved with PullApprove

exist, err := c.doesInitProcessExist(pid)
if !exist || err != nil {
return Stopped, err
if stat.StartTime != c.initProcessStartTime || stat.State == system.Zombie || stat.State == system.Dead {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off? Apologies if I missed it, and not a hold up for this PR, suggestion for later if missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?

I'm not sure what you mean. Can you give more detail on the case you feel is not covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a test with the state set as a Zombie or Dead to make sure that it gets picked off?

Ah, I think you mean a unit test where we run this on a dead/zombie process. The checkpoint_test.go change is doing that already, although there's too much going on there for it to be a unit test. If you have another way to manufacture a zombie process for testing, I'm happy to add a more focused test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was indeed thinking of a unit test, trying to ensure more code coverage. I unfortunately don't have a good way to emulate a zombie/dead process. I was just thinking of faking it out in a test to change the state of the process. I didn't see that happening for Dead/Zombie in the in the checkpoint_tests.go in particular, only that the process wasn't Stopped. Anywho, no need to address now, especially with this PR, just a point to mull.

@TomSweeneyRedHat
Copy link

LGTM, shoulda done this when I commented earlier today.

@crosbymichael
Copy link
Member

crosbymichael commented Jun 21, 2017

LGTM

Approved with PullApprove

@crosbymichael crosbymichael merged commit bd65ef6 into opencontainers:master Jun 21, 2017
@wking wking mentioned this pull request Jun 21, 2017
@wking wking deleted the process-status branch August 10, 2017 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants