Skip to content

runtime: Add an 'event' operation for subscribing to pushes#508

Closed
wking wants to merge 5 commits intoopencontainers:masterfrom
wking:event-operation
Closed

runtime: Add an 'event' operation for subscribing to pushes#508
wking wants to merge 5 commits intoopencontainers:masterfrom
wking:event-operation

Conversation

@wking
Copy link
Copy Markdown
Contributor

@wking wking commented Jun 23, 2016

The current state operation allows callers to poll for the state, but for some workflows polling is inefficient (how frequently do you poll to balance the cost of polling against the timeliness of the response?) and push notifications make more sense.

The runtime's create process is in a unique position to detect these status transitions.

  • As the actor carrying out container creation, it should have a clear idea of when that creation completes (for the created event).

  • It may setup a communication channel with the container process to orchestrate creation, and that channel may be used to report the start event.

  • It knows (or is) the parent of the container process, and POSIX's wait(3), waitpid(3), and waitid(3) only work for child processes. From wait(3):

    Nothing in this volume of POSIX.1-2008 prevents an
    implementation from providing extensions that permit a process
    to get status from a grandchild or any other process, but a
    process that does not use such extensions must be guaranteed to
    see status from only its direct children.

    So the runtime can setup (or be) the parent waiting on the container process, and arrange for the 'stopped' event to be published on container exit.

I've tried to phrase the requirements conservatively to allow for runtimes that have to poll their kernel or some such to notice these changes. I see the following runtime-support cases:

  1. The runtime can easily supply a push-based event operation. In this case, exposing that operation to callers faciliates push-based workflows without much cost.

  2. The runtime cannot supply a push-based event operation, and has to emulate it by polling. In this case, the runtime can pick a polling strategy that makes sense to its maintainers, and callers who aren't satisfied with that strategy can roll their own state poller without a big efficiency hit (in a lesser of two evils way).

    The requirement is currently worded so weakly that a runtime would be compliant with:

    1. Container process dies at noon.
    2. User calls state at 5pm.
    3. Runtime checks kernel, and sees that the container process is dead.
    4. Runtime publishes stopped event with a 5pm (and some microseconds) timestamp.
    5. Runtime returns state to the user with stopped in status.

    which is a pretty low bar.

  3. The runtime could supply a push-based event operation, but it would be a lot of work. These runtimes can use polling (2) as a quick-and-dirty solution until someone has time to implement a push-based solution (1). The improvement is transparent to users, who can use the same event operation throughout and passively reap the benefits of the implementation improvements.

Without an operation like this, higher levels that need to trigger on these transitions without polling need to exercise a lot of control over the system:

  • They must be the parent of (or on Linux, the nearest PR_SET_CHILD_SUBREAPER ancestor of) the create process, and the create process needs to exit after creation completes, if they want to block on the created event.
  • They must proxy all start and delete requests if they want to block on the started or deleted events.
  • On Linux, they must be the nearest ancestor of the create operation to set PR_SET_CHILD_SUBREAPER if they want to block on the stopped event.

By making event a runtime requirement, we allow for efficient cross-platform push-based workflows while avoiding the need for tight orchestrator gate-keeping. Runtimes that have implementation difficulties have an easy out that allows their callers to benefit from future implemenation improvements. And callers that are not satisfied can always fall back to polling state or the proxy/waitid/PR_SET_CHILD_SUBREAPER approaches.

The buffering requirement is very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. “buffer for 15 seconds” or “buffer the last 5 events” or “buffer the created event”) seemed out of place at the level of detail in this specification.

The goal is to allow for:

$ funC create --event-buffer created ID &
$ funC event --event created ID && hook1 && hook2 && funC start ID
$ fg

To support blocking on the event without racing on “maybe the created event happened before the event operation attached”.

Given the small number of required events, this buffering should not be a large resource concern, and --event-buffer created would only ever require a single event to be buffered per container.

There is still a race on “maybe the container has already been destroyed and a second container has been created with the same ID before the event operation attached”, but that seems much less likely (especially since the caller is free to pick UUIDs).

This builds on #507, so that should be reviewed first.

UPDATE: note that this pub/sub model does not require an additional long-running process. For an example implementation based on shared access to a directory (like runC uses for state) and inotify, see wking/inotify-pubsub.

wking added 5 commits June 3, 2016 13:42
proc(5) describes the following state entries in proc/[pid]/stat [1]
(for modern kernels):

* R Running
* S Sleeping in an interruptible wait
* D Waiting in uninterruptible disk sleep
* Z Zombie
* T Stopped (on a signal)
* t Tracing stop
* X Dead

and ps(1) has a bit more context [2] (for modern kernels):

* D uninterruptible sleep (usually IO)
* R running or runnable (on run queue)
* S interruptible sleep (waiting for an event to complete)
* T stopped by job control signal
* t stopped by debugger during the tracing
* X dead (should never be seen)
* Z defunct ("zombie") process, terminated but not reaped by its
  parent

So I expect "stopped" to mean "process still exists but is paused,
e.g. by SIGSTOP".  And I expect "exited" to mean "process has finished
and is either a zombie or dead".

After this commit, 'git grep -i stop' only turns up poststop-hook
stuff, a reference in principles.md, a "stoppage" in LICENSE, and some
ChangeLog entries.

Also replace "container's process" with "container process" to match
usage in the rest of the repository.  After this commit:

  $ git grep -i "container process" | wc -l
  16
  $ git grep -i "container's process" | wc -l
  1

Also reword status entries to avoid "running", which is less precise
in our spec (e.g. it also includes "sleeping", "waiting", ...).

Also removes a "them" leftover from a partial plural -> singular
reroll of be59415 (Split create and start, 2016-04-01, opencontainers#384).

[1]: http://man7.org/linux/man-pages/man5/proc.5.html
[2]: http://man7.org/linux/man-pages/man1/ps.1.html

Signed-off-by: W. Trevor King <[email protected]>
To distinguish between "we're still setting this container up" and
"we're finished setting up; you can call 'start' if you like".

Also reference the lifecycle steps, because you can't be too explicit

Signed-off-by: W. Trevor King <[email protected]>
Because during the 'creating' phase we may not have a container
process yet (e.g. if we're still reading the configuration or setting
up cgroups), and in the 'stopped' phase the PID is no longer
meaningful.

Signed-off-by: W. Trevor King <[email protected]>
The current 'state' operation allows callers to poll for the state,
but for some workflows polling is inefficient (how frequently do you
poll to balance the cost of polling against the timeliness of the
response?) and push notifications make more sense.

The runtime's 'create' process is in a unique position to detect these
status transitions.

* As the actor carrying out container creation, it should have a clear
  idea of when that creation completes (for the 'created' event).

* It may setup a communication channel with the container process to
  orchestrate creation, and that channel may be used to report the
  start event.

* It knows (or is) the parent of the container process, and POSIX's
  wait(3), waitpid(3), and waitid(3) only work for child processes
  [1,2].  From [1]:

    Nothing in this volume of POSIX.1-2008 prevents an implementation
    from providing extensions that permit a process to get status from
    a grandchild or any other process, but a process that does not use
    such extensions must be guaranteed to see status from only its
    direct children.

  So the runtime can setup (or be) the parent waiting on the container
  process, and arrange for the 'stopped' event to be published on
  container exit.

I've tried to phrase the requirements conservatively to allow for
runtimes that have to poll their kernel or some such to notice these
changes.  I see the following runtime-support cases:

a. The runtime can easily supply a push-based event operation.  In
   this case, exposing that operation to callers faciliates push-based
   workflows without much cost.

b. The runtime cannot supply a push-based event operation, and has to
   emulate it by polling.  In this case, the runtime can pick a
   polling strategy that makes sense to its maintainers, and callers
   who aren't satisfied with that strategy can roll their own state
   poller without a big efficiency hit (in a lesser of two evils way).

   The requirement is currently worded so weakly that a runtime would
   be compliant with:

   1. Container process dies at noon.
   2. User calls 'state' at 5pm.
   3. Runtime checks kernel, and sees that the container process is
      dead.
   4. Runtime publishes 'stopped' event with a 5pm (and some
      microseconds) timestamp.
   5. Runtime returns state to the user with 'stopped' in 'status'.

   which is a pretty low bar.

c. The runtime could supply a push-based event operation, but it would
   be a lot of work.  These runtimes can use polling (b) as a
   quick-and-dirty solution until someone has time to implement a
   push-based solution (a).  The improvement is transparent to users,
   who can use the same event operation throughout and passively reap
   the benefits of the implementation improvements.

Without an operation like this, higher levels that need to trigger on
these transitions without polling need to exercise a lot of control
over the system:

* They must be the parent of (or on Linux, the nearest
  PR_SET_CHILD_SUBREAPER ancestor of [3]) the create process, and the
  create process needs to exit after creation completes, if they want
  to block on the 'created' event.
* They must proxy all 'start' and 'delete' requests if they want to
  block on the 'started' or 'deleted' events.
* On Linux, they must be the nearest ancestor of the create operation
  to set PR_SET_CHILD_SUBREAPER if they want to block on the 'stopped'
  event.

By making 'event' a runtime requirement, we allow for efficient
cross-platform push-based workflows while avoiding the need for tight
orchestrator gate-keeping.  Runtimes that have implementation
difficulties have an easy out that allows their callers to benefit
from future implemenation improvements.  And callers that are not
satisfied can always fall back to polling state or the
proxy/waitid/PR_SET_CHILD_SUBREAPER approaches.

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html
[3]: http://man7.org/linux/man-pages/man2/prctl.2.html

Signed-off-by: W. Trevor King <[email protected]>
This is all very generic, and I expect more details to land in the
runtime API specification.  Requirements around the buffer-request
semantics (e.g. "buffer for 15 seconds" or "buffer the last 5 events"
or "buffer the 'created' event") seemed out of place at the level of
detail in this specification.

The goal is to allow for:

  $ funC create --event-buffer created ID &
  $ funC event --event created ID && hook1 && hook2 && funC start ID
  $ fg

To support blocking on the event without racing on "maybe the
'created' event happened before the 'event' operation attached".

Given the small number of required events, this buffering should not
be a large resource concern, and '--event-buffer created' would only
ever require a single event to be buffered per container.

There is still a race on "maybe the container has already been
destroyed and a second container has been created with the same ID
before the 'event' operation attached", but that seems much less
likely (especially since the caller is free to pick UUIDs).

Signed-off-by: W. Trevor King <[email protected]>
@wking
Copy link
Copy Markdown
Contributor Author

wking commented Jun 23, 2016

Some discussion with @mikebrow around this starting here.

@cyphar
Copy link
Copy Markdown
Member

cyphar commented Jul 12, 2016

I don't like this. For runC, it would mean that we'd have to make a long-running daemon to implement this functionality. IMO that defeats the whole point of runC -- it's a single binary you can use without needing any setup. If we make it so that runC doesn't exit after starting a container, then I'm still iffy about it.

@wking
Copy link
Copy Markdown
Contributor Author

wking commented Jul 12, 2016

On Mon, Jul 11, 2016 at 07:12:09PM -0700, Aleksa Sarai wrote:

For runC, it would mean that we'd have to make a long-running daemon
to implement this functionality.

Why is that? In ccon, where I'm happy with “the start-socket exists”
as the “created” event, I'll implement this operation with a second
long-running process. So:

event-handling-process
-- ccon host process -- ccon container process

where each parent outlives its child. But runtimes that value process
efficiency but still want a server side for their ‘event’ clients
could do:

event-handling host process
`-- container process

If you don't want extra server-side stuff for runC, you could put
events in a subdirectory of your state directory and have ‘event’
clients use inotify to track additions to that directory.

And if you want to invest the minimum dev time, you can always fall
back to polling ‘state’ calls (approach 3 in my runtime-support cases
1).

If we make it so that runC doesn't exit after starting a container,
then I'm still iffy about it.

I agree that a system without this ‘event’ operation is lighter
weight. But I think supporting ‘event’ is a lesser evil than the
“Without an operation like this…” conditions I list in 1. Do you
have alternative, portable solutions to those problems?

@mikebrow
Copy link
Copy Markdown
Member

mikebrow commented Jul 12, 2016

I'm good with the idea of an event model.. but agree we should keep tools like runc lower in the stack. For example, runC is not an orchestrator its a low level component. Thus I think the idea has been to keep this runtime spec at said lower level.

Just thinking out loud.. can't we map the event notifications to Mrunal's hooks and store event subscriptions as annotations?

IOW maybe the spec discussion should avoid describing how the implementation of the event model should be provided.

@wking
Copy link
Copy Markdown
Contributor Author

wking commented Jul 12, 2016

On Tue, Jul 12, 2016 at 07:05:15AM -0700, Mike Brown wrote:

… can't we map the event notifications to Mrunal's hooks and store
event registrations as annotations?

Can you unpack “event registrations”? Do you just mean “the event
JSON”, e.g.:

"annotations": {
"opencontainers.org/events": "[{"type": "created", "id": "oci-container1", "timestamp": "2016-07-12T09:33:10-0700"}, …]"
}

I like the state subdir approach better 1, since you don't have to
worry about parallel writes to a single state file, but runtimes
should be free to implement the operation however they like.

IOW maybe the spec should avoid describing how the implementation of
the event model should be provided.

I'm not sure where you're going with “should be provided”. I think
the implementation should be up to the runtime, and this PR doesn't go
into detail about the exposed API (since none of the operations in
runtime.md do).

But if this PR lands, I think the API spec (#511 / #513) should
specify a required command-line API along the lines of the funC
example in 2. Otherwise the compliance test suite and other generic
runtime-callers won't know how to invoke the operation.

@mikebrow
Copy link
Copy Markdown
Member

/s/event registrations/event subscriptions/

@crosbymichael
Copy link
Copy Markdown
Member

+1 to what @cyphar said on why we shouldn't do this

wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Jul 15, 2016
Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384).

I'm not a big fan of this early-exit 'create', which requires
platform-specific magic to collect the container process's exit code.

I have a proposal for an 'event' operation [1] which would provide a
convenient created trigger.  With 'event' in place, we don't need the
'create' process exit to serve as that trigger, and could have a
long-running 'create' that collects the container process exit code
using the portable waitid() family.  But the consensus after this
week's meeting was to table that while we land docs for the runC API
[2].

The "SHOULD exit as quickly as possible after ..." wording is going to
be hard to enforce (which is why it's not MUST), but with the runC
model, clients rely on the command exits to trigger post-create and
post-start activity.  The longer the runtime hangs around after
completing its action, the laggier those triggers will be.

The 'create' ID argument is required (while the old 'start' ID
argument was optional) because the runC approach requires an explicit
'delete' call to cleanup the container.  With a long-running 'create'
that could automatically cleanup, you could have an optional ID to
support users that don't care what ID is used (because they know they
won't be calling 'state', 'start', etc.).

One benefit of the early-exit 'create' is that the exit code does not
conflate container process exits with "failed to setup the sandbox"
exits.  We can take advantage of that and use non-zero 'create' exits
to allow stderr writing (so the runtime can log errors while dying
without having to successfully connect to syslog or some such).

The "MUST NOT attempt to read from its stdin" means a generic caller
can safely exec the command with a closed or null stdin, and not have
to worry about the command blocking or crashing because of that.  The
stdout spec for start/delete is more lenient, because runtimes are
unlikely to change their behavior because they are unable to write to
stdout.  If this assumption proves troublesome, we may have to tighten
it up later.

The ptrace idea is from Mrunal [3].

[1]: opencontainers#508
[2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15
[3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54
@wking
Copy link
Copy Markdown
Contributor Author

wking commented Jul 16, 2016

On Tue, Jul 12, 2016 at 01:02:21PM -0700, Michael Crosby wrote:

+1 to what @cyphar said on why we shouldn't do this

Responding to @cyphar, I pushed back on this requiring a long-running
daemon, waving my hands in a number of directions and mentioning
inotify briefly 1 ;). To make that daemonless inotify reference a
bit more concrete, I've mocked it up in shell 2. Following
@mikebrow's advice 3, I've added an UPDATE to my topic post to link
to that project too.

Does that address your concerns about long-running processes? If not,
can you be specific about why?

wking added a commit to wking/opencontainer-runtime-spec that referenced this pull request Jul 20, 2016
Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384).

I'm not a big fan of this early-exit 'create', which requires
platform-specific magic to collect the container process's exit code.

I have a proposal for an 'event' operation [1] which would provide a
convenient created trigger.  With 'event' in place, we don't need the
'create' process exit to serve as that trigger, and could have a
long-running 'create' that collects the container process exit code
using the portable waitid() family.  But the consensus after this
week's meeting was to table that while we land docs for the runC API
[2].

The "SHOULD exit as quickly as possible after ..." wording is going to
be hard to enforce (which is why it's not MUST), but with the runC
model, clients rely on the command exits to trigger post-create and
post-start activity.  The longer the runtime hangs around after
completing its action, the laggier those triggers will be.

The 'create' ID argument is required (while the old 'start' ID
argument was optional) because the runC approach requires an explicit
'delete' call to cleanup the container.  With a long-running 'create'
that could automatically cleanup, you could have an optional ID to
support users that don't care what ID is used (because they know they
won't be calling 'state', 'start', etc.).

One benefit of the early-exit 'create' is that the exit code does not
conflate container process exits with "failed to setup the sandbox"
exits.  We can take advantage of that and use non-zero 'create' exits
to allow stderr writing (so the runtime can log errors while dying
without having to successfully connect to syslog or some such).

The "MUST NOT attempt to read from its stdin" means a generic caller
can safely exec the command with a closed or null stdin, and not have
to worry about the command blocking or crashing because of that.  The
stdout spec for start/delete is more lenient, because runtimes are
unlikely to change their behavior because they are unable to write to
stdout.  If this assumption proves troublesome, we may have to tighten
it up later.

The ptrace idea is from Mrunal [3].

[1]: opencontainers#508
[2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15
[3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54

Signed-off-by: W. Trevor King <[email protected]>
wking added a commit to wking/ccon that referenced this pull request Jul 22, 2016
Along the lines of the create/start split we've been discussing for
the OCI [1,2].  Pull some common functionality into a libccon,
although I don't intend to make that a more than an internal helper.

Post-error cleanup in ccon-cli is pretty slim, since the process is
just about to die anyway.  I'll probably go back through and add
proper cleanup later.

serve_socket gets too deeply nested for my taste, but it's not quite
to the point where I'd pull out the request handling into a
handle_start_request helper or some such.  Mostly because I'd have to
either setup a new buf/iov/msg in the helper's stack or pass that all
through from serve_socket ;).

A few notes on the tests:

I don't have a stand-alone 'wait' on my system (it's built into most
shells [3]), but I've added an explicit check for it because POSIX
doesn't require it to be built in.

The waiting in the create/start tests is a bit awkward, but here's the
reasoning for the various steps:

* Put the socket in a 'sock' subdirectory so we don't have to mess
  with inotifywait's --exclude (when what we really want is an
  --include).  In most cases, the socket would be the first thing
  created in the test directory, but the process.host test will create
  a pivot-root.* before creating the socket.

* Clear a 'wait' file before launching the inotifywait/start subshell
  and append to it from that subshell so grep has something to look at
  (even if it racily starts looking before the subshell processes the
  inotifywait line).

* Block on a busy-wait grep until inotifywait sets up its watcher.
  This ensures we don't call ccon and create the socket before the
  watcher is looking.

  The zero-second sleep while we wait for inotifywait to setup its
  watcher is busy, but POSIX doesn't require 'sleep' to support
  non-integer times [4].

All of these issues could be abstracted out into an 'event' command
[5], but they're fine for a proof-of-concept create/start split.

[1]: https://groups.google.com/a/opencontainers.org/d/msg/dev/qWHoKs8Fsrk/k55FQrBzBgAJ
     Subject: Re: Splitting Start into create() and run()
     Date: Thu, 24 Mar 2016 15:04:14 -0700
     Message-ID: <[email protected]>
[2]: opencontainers/runtime-spec#384
     Subject: Split create and start
[3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html#tag_17_06
[4]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sleep.html
[5]: opencontainers/runtime-spec#508
     Subject: runtime: Add an 'event' operation for subscribing to
       pushes
@wking
Copy link
Copy Markdown
Contributor Author

wking commented Jul 23, 2016

On Sat, Jul 16, 2016 at 04:03:17PM -0700, W. Trevor King wrote:

To make that daemonless inotify reference a bit more concrete, I've
mocked it up in shell [2].

[2]: https://github.com/wking/inotify-pubsub

And to make the OCI integration more concrete too, I have a ccon
branch where ccon-oci implements an event command along these lines
1. You can do (with process.terminal true):

$ echo 'echo hi; exit' | ccon-oci create --id container-1 >output &
$ ccon-oci event -He created container-1 && ccon-oci start container-1
{"type": "created", "timestamp": "2016-07-23T04:16:24.503950Z", "id": "container-1"}
$ cat /tmp/output
hi

If you use a second terminal for the event/start call, you can skip
the standard stream redirection:

term1 $ ccon-oci create --id container-1
…blocks after creation…

term2 $ ccon-oci event -He created container-1 && ccon-oci start container-1
{"timestamp": "2016-07-23T04:21:20.675097Z", "id": "container-1", "type": "created"}

…create unblocks on term1…
/root # echo hi
hi
/root # exit 17
term1 $ echo $?
17

That seems pretty clean to me, and doesn't require much implementation
magic to distribute events or collect the container-process's exit
code.

wking pushed a commit to wking/opencontainer-runtime-spec that referenced this pull request Jul 25, 2016