runtime: Add an 'event' operation for subscribing to pushes by wking · Pull Request #508 · opencontainers/runtime-spec

wking · 2016-06-23T23:38:02Z

The current state operation allows callers to poll for the state, but for some workflows polling is inefficient (how frequently do you poll to balance the cost of polling against the timeliness of the response?) and push notifications make more sense.

The runtime's create process is in a unique position to detect these status transitions.

As the actor carrying out container creation, it should have a clear idea of when that creation completes (for the created event).
It may setup a communication channel with the container process to orchestrate creation, and that channel may be used to report the start event.
It knows (or is) the parent of the container process, and POSIX's wait(3), waitpid(3), and waitid(3) only work for child processes. From wait(3):

Nothing in this volume of POSIX.1-2008 prevents an
implementation from providing extensions that permit a process
to get status from a grandchild or any other process, but a
process that does not use such extensions must be guaranteed to
see status from only its direct children.

So the runtime can setup (or be) the parent waiting on the container process, and arrange for the 'stopped' event to be published on container exit.

I've tried to phrase the requirements conservatively to allow for runtimes that have to poll their kernel or some such to notice these changes. I see the following runtime-support cases:

The runtime can easily supply a push-based event operation. In this case, exposing that operation to callers faciliates push-based workflows without much cost.
The runtime cannot supply a push-based event operation, and has to emulate it by polling. In this case, the runtime can pick a polling strategy that makes sense to its maintainers, and callers who aren't satisfied with that strategy can roll their own state poller without a big efficiency hit (in a lesser of two evils way).

The requirement is currently worded so weakly that a runtime would be compliant with:
1. Container process dies at noon.
2. User calls state at 5pm.
3. Runtime checks kernel, and sees that the container process is dead.
4. Runtime publishes stopped event with a 5pm (and some microseconds) timestamp.
5. Runtime returns state to the user with stopped in status.
which is a pretty low bar.
The runtime could supply a push-based event operation, but it would be a lot of work. These runtimes can use polling (2) as a quick-and-dirty solution until someone has time to implement a push-based solution (1). The improvement is transparent to users, who can use the same event operation throughout and passively reap the benefits of the implementation improvements.

Without an operation like this, higher levels that need to trigger on these transitions without polling need to exercise a lot of control over the system:

They must be the parent of (or on Linux, the nearest PR_SET_CHILD_SUBREAPER ancestor of) the create process, and the create process needs to exit after creation completes, if they want to block on the created event.
They must proxy all start and delete requests if they want to block on the started or deleted events.
On Linux, they must be the nearest ancestor of the create operation to set PR_SET_CHILD_SUBREAPER if they want to block on the stopped event.

By making event a runtime requirement, we allow for efficient cross-platform push-based workflows while avoiding the need for tight orchestrator gate-keeping. Runtimes that have implementation difficulties have an easy out that allows their callers to benefit from future implemenation improvements. And callers that are not satisfied can always fall back to polling state or the proxy/waitid/PR_SET_CHILD_SUBREAPER approaches.

The buffering requirement is very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. “buffer for 15 seconds” or “buffer the last 5 events” or “buffer the created event”) seemed out of place at the level of detail in this specification.

The goal is to allow for:

$ funC create --event-buffer created ID &
$ funC event --event created ID && hook1 && hook2 && funC start ID
$ fg

To support blocking on the event without racing on “maybe the created event happened before the event operation attached”.

Given the small number of required events, this buffering should not be a large resource concern, and --event-buffer created would only ever require a single event to be buffered per container.

There is still a race on “maybe the container has already been destroyed and a second container has been created with the same ID before the event operation attached”, but that seems much less likely (especially since the caller is free to pick UUIDs).

This builds on #507, so that should be reviewed first.

UPDATE: note that this pub/sub model does not require an additional long-running process. For an example implementation based on shared access to a directory (like runC uses for state) and inotify, see wking/inotify-pubsub.

proc(5) describes the following state entries in proc/[pid]/stat [1] (for modern kernels): * R Running * S Sleeping in an interruptible wait * D Waiting in uninterruptible disk sleep * Z Zombie * T Stopped (on a signal) * t Tracing stop * X Dead and ps(1) has a bit more context [2] (for modern kernels): * D uninterruptible sleep (usually IO) * R running or runnable (on run queue) * S interruptible sleep (waiting for an event to complete) * T stopped by job control signal * t stopped by debugger during the tracing * X dead (should never be seen) * Z defunct ("zombie") process, terminated but not reaped by its parent So I expect "stopped" to mean "process still exists but is paused, e.g. by SIGSTOP". And I expect "exited" to mean "process has finished and is either a zombie or dead". After this commit, 'git grep -i stop' only turns up poststop-hook stuff, a reference in principles.md, a "stoppage" in LICENSE, and some ChangeLog entries. Also replace "container's process" with "container process" to match usage in the rest of the repository. After this commit: $ git grep -i "container process" | wc -l 16 $ git grep -i "container's process" | wc -l 1 Also reword status entries to avoid "running", which is less precise in our spec (e.g. it also includes "sleeping", "waiting", ...). Also removes a "them" leftover from a partial plural -> singular reroll of be59415 (Split create and start, 2016-04-01, opencontainers#384). [1]: http://man7.org/linux/man-pages/man5/proc.5.html [2]: http://man7.org/linux/man-pages/man1/ps.1.html Signed-off-by: W. Trevor King <[email protected]>

To distinguish between "we're still setting this container up" and "we're finished setting up; you can call 'start' if you like". Also reference the lifecycle steps, because you can't be too explicit Signed-off-by: W. Trevor King <[email protected]>

Because during the 'creating' phase we may not have a container process yet (e.g. if we're still reading the configuration or setting up cgroups), and in the 'stopped' phase the PID is no longer meaningful. Signed-off-by: W. Trevor King <[email protected]>

The current 'state' operation allows callers to poll for the state, but for some workflows polling is inefficient (how frequently do you poll to balance the cost of polling against the timeliness of the response?) and push notifications make more sense. The runtime's 'create' process is in a unique position to detect these status transitions. * As the actor carrying out container creation, it should have a clear idea of when that creation completes (for the 'created' event). * It may setup a communication channel with the container process to orchestrate creation, and that channel may be used to report the start event. * It knows (or is) the parent of the container process, and POSIX's wait(3), waitpid(3), and waitid(3) only work for child processes [1,2]. From [1]: Nothing in this volume of POSIX.1-2008 prevents an implementation from providing extensions that permit a process to get status from a grandchild or any other process, but a process that does not use such extensions must be guaranteed to see status from only its direct children. So the runtime can setup (or be) the parent waiting on the container process, and arrange for the 'stopped' event to be published on container exit. I've tried to phrase the requirements conservatively to allow for runtimes that have to poll their kernel or some such to notice these changes. I see the following runtime-support cases: a. The runtime can easily supply a push-based event operation. In this case, exposing that operation to callers faciliates push-based workflows without much cost. b. The runtime cannot supply a push-based event operation, and has to emulate it by polling. In this case, the runtime can pick a polling strategy that makes sense to its maintainers, and callers who aren't satisfied with that strategy can roll their own state poller without a big efficiency hit (in a lesser of two evils way). The requirement is currently worded so weakly that a runtime would be compliant with: 1. Container process dies at noon. 2. User calls 'state' at 5pm. 3. Runtime checks kernel, and sees that the container process is dead. 4. Runtime publishes 'stopped' event with a 5pm (and some microseconds) timestamp. 5. Runtime returns state to the user with 'stopped' in 'status'. which is a pretty low bar. c. The runtime could supply a push-based event operation, but it would be a lot of work. These runtimes can use polling (b) as a quick-and-dirty solution until someone has time to implement a push-based solution (a). The improvement is transparent to users, who can use the same event operation throughout and passively reap the benefits of the implementation improvements. Without an operation like this, higher levels that need to trigger on these transitions without polling need to exercise a lot of control over the system: * They must be the parent of (or on Linux, the nearest PR_SET_CHILD_SUBREAPER ancestor of [3]) the create process, and the create process needs to exit after creation completes, if they want to block on the 'created' event. * They must proxy all 'start' and 'delete' requests if they want to block on the 'started' or 'deleted' events. * On Linux, they must be the nearest ancestor of the create operation to set PR_SET_CHILD_SUBREAPER if they want to block on the 'stopped' event. By making 'event' a runtime requirement, we allow for efficient cross-platform push-based workflows while avoiding the need for tight orchestrator gate-keeping. Runtimes that have implementation difficulties have an easy out that allows their callers to benefit from future implemenation improvements. And callers that are not satisfied can always fall back to polling state or the proxy/waitid/PR_SET_CHILD_SUBREAPER approaches. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html [2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html [3]: http://man7.org/linux/man-pages/man2/prctl.2.html Signed-off-by: W. Trevor King <[email protected]>

This is all very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. "buffer for 15 seconds" or "buffer the last 5 events" or "buffer the 'created' event") seemed out of place at the level of detail in this specification. The goal is to allow for: $ funC create --event-buffer created ID & $ funC event --event created ID && hook1 && hook2 && funC start ID $ fg To support blocking on the event without racing on "maybe the 'created' event happened before the 'event' operation attached". Given the small number of required events, this buffering should not be a large resource concern, and '--event-buffer created' would only ever require a single event to be buffered per container. There is still a race on "maybe the container has already been destroyed and a second container has been created with the same ID before the 'event' operation attached", but that seems much less likely (especially since the caller is free to pick UUIDs). Signed-off-by: W. Trevor King <[email protected]>

wking · 2016-06-23T23:40:51Z

Some discussion with @mikebrow around this starting here.

cyphar · 2016-07-12T02:12:07Z

I don't like this. For runC, it would mean that we'd have to make a long-running daemon to implement this functionality. IMO that defeats the whole point of runC -- it's a single binary you can use without needing any setup. If we make it so that runC doesn't exit after starting a container, then I'm still iffy about it.

wking · 2016-07-12T05:29:21Z

On Mon, Jul 11, 2016 at 07:12:09PM -0700, Aleksa Sarai wrote:

For runC, it would mean that we'd have to make a long-running daemon
to implement this functionality.

Why is that? In ccon, where I'm happy with “the start-socket exists”
as the “created” event, I'll implement this operation with a second
long-running process. So:

event-handling-process
-- ccon host process -- ccon container process

where each parent outlives its child. But runtimes that value process
efficiency but still want a server side for their ‘event’ clients
could do:

event-handling host process
`-- container process

If you don't want extra server-side stuff for runC, you could put
events in a subdirectory of your state directory and have ‘event’
clients use inotify to track additions to that directory.

And if you want to invest the minimum dev time, you can always fall
back to polling ‘state’ calls (approach 3 in my runtime-support cases
1).

If we make it so that runC doesn't exit after starting a container,
then I'm still iffy about it.

I agree that a system without this ‘event’ operation is lighter
weight. But I think supporting ‘event’ is a lesser evil than the
“Without an operation like this…” conditions I list in 1. Do you
have alternative, portable solutions to those problems?

mikebrow · 2016-07-12T14:05:14Z

I'm good with the idea of an event model.. but agree we should keep tools like runc lower in the stack. For example, runC is not an orchestrator its a low level component. Thus I think the idea has been to keep this runtime spec at said lower level.

Just thinking out loud.. can't we map the event notifications to Mrunal's hooks and store event subscriptions as annotations?

IOW maybe the spec discussion should avoid describing how the implementation of the event model should be provided.

wking · 2016-07-12T16:56:06Z

On Tue, Jul 12, 2016 at 07:05:15AM -0700, Mike Brown wrote:

… can't we map the event notifications to Mrunal's hooks and store
event registrations as annotations?

Can you unpack “event registrations”? Do you just mean “the event
JSON”, e.g.:

"annotations": {
"opencontainers.org/events": "[{"type": "created", "id": "oci-container1", "timestamp": "2016-07-12T09:33:10-0700"}, …]"
}

I like the state subdir approach better 1, since you don't have to
worry about parallel writes to a single state file, but runtimes
should be free to implement the operation however they like.

IOW maybe the spec should avoid describing how the implementation of
the event model should be provided.

I'm not sure where you're going with “should be provided”. I think
the implementation should be up to the runtime, and this PR doesn't go
into detail about the exposed API (since none of the operations in
runtime.md do).

But if this PR lands, I think the API spec (#511 / #513) should
specify a required command-line API along the lines of the funC
example in 2. Otherwise the compliance test suite and other generic
runtime-callers won't know how to invoke the operation.

mikebrow · 2016-07-12T20:01:01Z

/s/event registrations/event subscriptions/

crosbymichael · 2016-07-12T20:02:20Z

+1 to what @cyphar said on why we shouldn't do this

Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384). I'm not a big fan of this early-exit 'create', which requires platform-specific magic to collect the container process's exit code. I have a proposal for an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after this week's meeting was to table that while we land docs for the runC API [2]. The "SHOULD exit as quickly as possible after ..." wording is going to be hard to enforce (which is why it's not MUST), but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The 'create' ID argument is required (while the old 'start' ID argument was optional) because the runC approach requires an explicit 'delete' call to cleanup the container. With a long-running 'create' that could automatically cleanup, you could have an optional ID to support users that don't care what ID is used (because they know they won't be calling 'state', 'start', etc.). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea is from Mrunal [3]. [1]: opencontainers#508 [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54

wking · 2016-07-16T23:03:15Z

On Tue, Jul 12, 2016 at 01:02:21PM -0700, Michael Crosby wrote:

+1 to what @cyphar said on why we shouldn't do this

Responding to @cyphar, I pushed back on this requiring a long-running
daemon, waving my hands in a number of directions and mentioning
inotify briefly 1 ;). To make that daemonless inotify reference a
bit more concrete, I've mocked it up in shell 2. Following
@mikebrow's advice 3, I've added an UPDATE to my topic post to link
to that project too.

Does that address your concerns about long-running processes? If not,
can you be specific about why?

Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384). I'm not a big fan of this early-exit 'create', which requires platform-specific magic to collect the container process's exit code. I have a proposal for an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after this week's meeting was to table that while we land docs for the runC API [2]. The "SHOULD exit as quickly as possible after ..." wording is going to be hard to enforce (which is why it's not MUST), but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The 'create' ID argument is required (while the old 'start' ID argument was optional) because the runC approach requires an explicit 'delete' call to cleanup the container. With a long-running 'create' that could automatically cleanup, you could have an optional ID to support users that don't care what ID is used (because they know they won't be calling 'state', 'start', etc.). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea is from Mrunal [3]. [1]: opencontainers#508 [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54 Signed-off-by: W. Trevor King <[email protected]>

Along the lines of the create/start split we've been discussing for the OCI [1,2]. Pull some common functionality into a libccon, although I don't intend to make that a more than an internal helper. Post-error cleanup in ccon-cli is pretty slim, since the process is just about to die anyway. I'll probably go back through and add proper cleanup later. serve_socket gets too deeply nested for my taste, but it's not quite to the point where I'd pull out the request handling into a handle_start_request helper or some such. Mostly because I'd have to either setup a new buf/iov/msg in the helper's stack or pass that all through from serve_socket ;). A few notes on the tests: I don't have a stand-alone 'wait' on my system (it's built into most shells [3]), but I've added an explicit check for it because POSIX doesn't require it to be built in. The waiting in the create/start tests is a bit awkward, but here's the reasoning for the various steps: * Put the socket in a 'sock' subdirectory so we don't have to mess with inotifywait's --exclude (when what we really want is an --include). In most cases, the socket would be the first thing created in the test directory, but the process.host test will create a pivot-root.* before creating the socket. * Clear a 'wait' file before launching the inotifywait/start subshell and append to it from that subshell so grep has something to look at (even if it racily starts looking before the subshell processes the inotifywait line). * Block on a busy-wait grep until inotifywait sets up its watcher. This ensures we don't call ccon and create the socket before the watcher is looking. The zero-second sleep while we wait for inotifywait to setup its watcher is busy, but POSIX doesn't require 'sleep' to support non-integer times [4]. All of these issues could be abstracted out into an 'event' command [5], but they're fine for a proof-of-concept create/start split. [1]: https://groups.google.com/a/opencontainers.org/d/msg/dev/qWHoKs8Fsrk/k55FQrBzBgAJ Subject: Re: Splitting Start into create() and run() Date: Thu, 24 Mar 2016 15:04:14 -0700 Message-ID: <[email protected]> [2]: opencontainers/runtime-spec#384 Subject: Split create and start [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html#tag_17_06 [4]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sleep.html [5]: opencontainers/runtime-spec#508 Subject: runtime: Add an 'event' operation for subscribing to pushes

wking · 2016-07-23T04:26:39Z

On Sat, Jul 16, 2016 at 04:03:17PM -0700, W. Trevor King wrote:

To make that daemonless inotify reference a bit more concrete, I've
mocked it up in shell [2].
…
[2]: https://github.com/wking/inotify-pubsub

And to make the OCI integration more concrete too, I have a ccon
branch where ccon-oci implements an event command along these lines
1. You can do (with process.terminal true):

$ echo 'echo hi; exit' | ccon-oci create --id container-1 >output &
$ ccon-oci event -He created container-1 && ccon-oci start container-1
{"type": "created", "timestamp": "2016-07-23T04:16:24.503950Z", "id": "container-1"}
$ cat /tmp/output
hi

If you use a second terminal for the event/start call, you can skip
the standard stream redirection:

term1 $ ccon-oci create --id container-1
…blocks after creation…

term2 $ ccon-oci event -He created container-1 && ccon-oci start container-1
{"timestamp": "2016-07-23T04:21:20.675097Z", "id": "container-1", "type": "created"}

…create unblocks on term1…
/root # echo hi
hi
/root # exit 17
term1 $ echo $?
17

That seems pretty clean to me, and doesn't require much implementation
magic to distribute events or collect the container-process's exit
code.

wking added 5 commits June 3, 2016 13:42

This was referenced Jul 7, 2016

Add Runtime CLI Spec #513

Closed

Add initial pass at a cmd line spec #511

Closed

wking pushed a commit to wking/opencontainer-runtime-spec that referenced this pull request Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: Add an 'event' operation for subscribing to pushes#508

runtime: Add an 'event' operation for subscribing to pushes#508
wking wants to merge 5 commits intoopencontainers:masterfrom
wking:event-operation

wking commented Jun 23, 2016 •

edited

Loading

Uh oh!

wking commented Jun 23, 2016

Uh oh!

cyphar commented Jul 12, 2016

Uh oh!

wking commented Jul 12, 2016

Uh oh!

mikebrow commented Jul 12, 2016 •

edited

Loading

Uh oh!

wking commented Jul 12, 2016

Uh oh!

mikebrow commented Jul 12, 2016

Uh oh!

crosbymichael commented Jul 12, 2016

Uh oh!

wking commented Jul 16, 2016 •

edited

Loading

Uh oh!

wking commented Jul 23, 2016

Uh oh!

Conversation

wking commented Jun 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jun 23, 2016

Uh oh!

cyphar commented Jul 12, 2016

Uh oh!

wking commented Jul 12, 2016

Uh oh!

mikebrow commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jul 12, 2016

Uh oh!

mikebrow commented Jul 12, 2016

Uh oh!

crosbymichael commented Jul 12, 2016

Uh oh!

wking commented Jul 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Jul 23, 2016

Uh oh!

wking commented Jun 23, 2016 •

edited

Loading

mikebrow commented Jul 12, 2016 •

edited

Loading

wking commented Jul 16, 2016 •

edited

Loading