runtime: Add an 'event' operation for subscribing to pushes#508
runtime: Add an 'event' operation for subscribing to pushes#508wking wants to merge 5 commits intoopencontainers:masterfrom
Conversation
proc(5) describes the following state entries in proc/[pid]/stat [1]
(for modern kernels):
* R Running
* S Sleeping in an interruptible wait
* D Waiting in uninterruptible disk sleep
* Z Zombie
* T Stopped (on a signal)
* t Tracing stop
* X Dead
and ps(1) has a bit more context [2] (for modern kernels):
* D uninterruptible sleep (usually IO)
* R running or runnable (on run queue)
* S interruptible sleep (waiting for an event to complete)
* T stopped by job control signal
* t stopped by debugger during the tracing
* X dead (should never be seen)
* Z defunct ("zombie") process, terminated but not reaped by its
parent
So I expect "stopped" to mean "process still exists but is paused,
e.g. by SIGSTOP". And I expect "exited" to mean "process has finished
and is either a zombie or dead".
After this commit, 'git grep -i stop' only turns up poststop-hook
stuff, a reference in principles.md, a "stoppage" in LICENSE, and some
ChangeLog entries.
Also replace "container's process" with "container process" to match
usage in the rest of the repository. After this commit:
$ git grep -i "container process" | wc -l
16
$ git grep -i "container's process" | wc -l
1
Also reword status entries to avoid "running", which is less precise
in our spec (e.g. it also includes "sleeping", "waiting", ...).
Also removes a "them" leftover from a partial plural -> singular
reroll of be59415 (Split create and start, 2016-04-01, opencontainers#384).
[1]: http://man7.org/linux/man-pages/man5/proc.5.html
[2]: http://man7.org/linux/man-pages/man1/ps.1.html
Signed-off-by: W. Trevor King <[email protected]>
To distinguish between "we're still setting this container up" and "we're finished setting up; you can call 'start' if you like". Also reference the lifecycle steps, because you can't be too explicit Signed-off-by: W. Trevor King <[email protected]>
Because during the 'creating' phase we may not have a container process yet (e.g. if we're still reading the configuration or setting up cgroups), and in the 'stopped' phase the PID is no longer meaningful. Signed-off-by: W. Trevor King <[email protected]>
The current 'state' operation allows callers to poll for the state,
but for some workflows polling is inefficient (how frequently do you
poll to balance the cost of polling against the timeliness of the
response?) and push notifications make more sense.
The runtime's 'create' process is in a unique position to detect these
status transitions.
* As the actor carrying out container creation, it should have a clear
idea of when that creation completes (for the 'created' event).
* It may setup a communication channel with the container process to
orchestrate creation, and that channel may be used to report the
start event.
* It knows (or is) the parent of the container process, and POSIX's
wait(3), waitpid(3), and waitid(3) only work for child processes
[1,2]. From [1]:
Nothing in this volume of POSIX.1-2008 prevents an implementation
from providing extensions that permit a process to get status from
a grandchild or any other process, but a process that does not use
such extensions must be guaranteed to see status from only its
direct children.
So the runtime can setup (or be) the parent waiting on the container
process, and arrange for the 'stopped' event to be published on
container exit.
I've tried to phrase the requirements conservatively to allow for
runtimes that have to poll their kernel or some such to notice these
changes. I see the following runtime-support cases:
a. The runtime can easily supply a push-based event operation. In
this case, exposing that operation to callers faciliates push-based
workflows without much cost.
b. The runtime cannot supply a push-based event operation, and has to
emulate it by polling. In this case, the runtime can pick a
polling strategy that makes sense to its maintainers, and callers
who aren't satisfied with that strategy can roll their own state
poller without a big efficiency hit (in a lesser of two evils way).
The requirement is currently worded so weakly that a runtime would
be compliant with:
1. Container process dies at noon.
2. User calls 'state' at 5pm.
3. Runtime checks kernel, and sees that the container process is
dead.
4. Runtime publishes 'stopped' event with a 5pm (and some
microseconds) timestamp.
5. Runtime returns state to the user with 'stopped' in 'status'.
which is a pretty low bar.
c. The runtime could supply a push-based event operation, but it would
be a lot of work. These runtimes can use polling (b) as a
quick-and-dirty solution until someone has time to implement a
push-based solution (a). The improvement is transparent to users,
who can use the same event operation throughout and passively reap
the benefits of the implementation improvements.
Without an operation like this, higher levels that need to trigger on
these transitions without polling need to exercise a lot of control
over the system:
* They must be the parent of (or on Linux, the nearest
PR_SET_CHILD_SUBREAPER ancestor of [3]) the create process, and the
create process needs to exit after creation completes, if they want
to block on the 'created' event.
* They must proxy all 'start' and 'delete' requests if they want to
block on the 'started' or 'deleted' events.
* On Linux, they must be the nearest ancestor of the create operation
to set PR_SET_CHILD_SUBREAPER if they want to block on the 'stopped'
event.
By making 'event' a runtime requirement, we allow for efficient
cross-platform push-based workflows while avoiding the need for tight
orchestrator gate-keeping. Runtimes that have implementation
difficulties have an easy out that allows their callers to benefit
from future implemenation improvements. And callers that are not
satisfied can always fall back to polling state or the
proxy/waitid/PR_SET_CHILD_SUBREAPER approaches.
[1]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html
[3]: http://man7.org/linux/man-pages/man2/prctl.2.html
Signed-off-by: W. Trevor King <[email protected]>
This is all very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. "buffer for 15 seconds" or "buffer the last 5 events" or "buffer the 'created' event") seemed out of place at the level of detail in this specification. The goal is to allow for: $ funC create --event-buffer created ID & $ funC event --event created ID && hook1 && hook2 && funC start ID $ fg To support blocking on the event without racing on "maybe the 'created' event happened before the 'event' operation attached". Given the small number of required events, this buffering should not be a large resource concern, and '--event-buffer created' would only ever require a single event to be buffered per container. There is still a race on "maybe the container has already been destroyed and a second container has been created with the same ID before the 'event' operation attached", but that seems much less likely (especially since the caller is free to pick UUIDs). Signed-off-by: W. Trevor King <[email protected]>
|
I don't like this. For runC, it would mean that we'd have to make a long-running daemon to implement this functionality. IMO that defeats the whole point of runC -- it's a single binary you can use without needing any setup. If we make it so that runC doesn't exit after starting a container, then I'm still iffy about it. |
|
On Mon, Jul 11, 2016 at 07:12:09PM -0700, Aleksa Sarai wrote:
Why is that? In ccon, where I'm happy with “the start-socket exists” event-handling-process where each parent outlives its child. But runtimes that value process event-handling host process If you don't want extra server-side stuff for runC, you could put And if you want to invest the minimum dev time, you can always fall
I agree that a system without this ‘event’ operation is lighter |
|
I'm good with the idea of an event model.. but agree we should keep tools like runc lower in the stack. For example, runC is not an orchestrator its a low level component. Thus I think the idea has been to keep this runtime spec at said lower level. Just thinking out loud.. can't we map the event notifications to Mrunal's hooks and store event subscriptions as annotations? IOW maybe the spec discussion should avoid describing how the implementation of the event model should be provided. |
|
On Tue, Jul 12, 2016 at 07:05:15AM -0700, Mike Brown wrote:
Can you unpack “event registrations”? Do you just mean “the event "annotations": { I like the state subdir approach better 1, since you don't have to
I'm not sure where you're going with “should be provided”. I think But if this PR lands, I think the API spec (#511 / #513) should |
|
/s/event registrations/event subscriptions/ |
|
+1 to what @cyphar said on why we shouldn't do this |
Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384). I'm not a big fan of this early-exit 'create', which requires platform-specific magic to collect the container process's exit code. I have a proposal for an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after this week's meeting was to table that while we land docs for the runC API [2]. The "SHOULD exit as quickly as possible after ..." wording is going to be hard to enforce (which is why it's not MUST), but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The 'create' ID argument is required (while the old 'start' ID argument was optional) because the runC approach requires an explicit 'delete' call to cleanup the container. With a long-running 'create' that could automatically cleanup, you could have an optional ID to support users that don't care what ID is used (because they know they won't be calling 'state', 'start', etc.). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea is from Mrunal [3]. [1]: opencontainers#508 [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54
|
On Tue, Jul 12, 2016 at 01:02:21PM -0700, Michael Crosby wrote:
Responding to @cyphar, I pushed back on this requiring a long-running Does that address your concerns about long-running processes? If not, |
Catch up with be59415 (Split create and start, 2016-04-01, opencontainers#384). I'm not a big fan of this early-exit 'create', which requires platform-specific magic to collect the container process's exit code. I have a proposal for an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after this week's meeting was to table that while we land docs for the runC API [2]. The "SHOULD exit as quickly as possible after ..." wording is going to be hard to enforce (which is why it's not MUST), but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The 'create' ID argument is required (while the old 'start' ID argument was optional) because the runC approach requires an explicit 'delete' call to cleanup the container. With a long-running 'create' that could automatically cleanup, you could have an optional ID to support users that don't care what ID is used (because they know they won't be calling 'state', 'start', etc.). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea is from Mrunal [3]. [1]: opencontainers#508 [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54 Signed-off-by: W. Trevor King <[email protected]>
Along the lines of the create/start split we've been discussing for the OCI [1,2]. Pull some common functionality into a libccon, although I don't intend to make that a more than an internal helper. Post-error cleanup in ccon-cli is pretty slim, since the process is just about to die anyway. I'll probably go back through and add proper cleanup later. serve_socket gets too deeply nested for my taste, but it's not quite to the point where I'd pull out the request handling into a handle_start_request helper or some such. Mostly because I'd have to either setup a new buf/iov/msg in the helper's stack or pass that all through from serve_socket ;). A few notes on the tests: I don't have a stand-alone 'wait' on my system (it's built into most shells [3]), but I've added an explicit check for it because POSIX doesn't require it to be built in. The waiting in the create/start tests is a bit awkward, but here's the reasoning for the various steps: * Put the socket in a 'sock' subdirectory so we don't have to mess with inotifywait's --exclude (when what we really want is an --include). In most cases, the socket would be the first thing created in the test directory, but the process.host test will create a pivot-root.* before creating the socket. * Clear a 'wait' file before launching the inotifywait/start subshell and append to it from that subshell so grep has something to look at (even if it racily starts looking before the subshell processes the inotifywait line). * Block on a busy-wait grep until inotifywait sets up its watcher. This ensures we don't call ccon and create the socket before the watcher is looking. The zero-second sleep while we wait for inotifywait to setup its watcher is busy, but POSIX doesn't require 'sleep' to support non-integer times [4]. All of these issues could be abstracted out into an 'event' command [5], but they're fine for a proof-of-concept create/start split. [1]: https://groups.google.com/a/opencontainers.org/d/msg/dev/qWHoKs8Fsrk/k55FQrBzBgAJ Subject: Re: Splitting Start into create() and run() Date: Thu, 24 Mar 2016 15:04:14 -0700 Message-ID: <[email protected]> [2]: opencontainers/runtime-spec#384 Subject: Split create and start [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap01.html#tag_17_06 [4]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sleep.html [5]: opencontainers/runtime-spec#508 Subject: runtime: Add an 'event' operation for subscribing to pushes
|
On Sat, Jul 16, 2016 at 04:03:17PM -0700, W. Trevor King wrote:
And to make the OCI integration more concrete too, I have a ccon $ echo 'echo hi; exit' | ccon-oci create --id container-1 >output & If you use a second terminal for the event/start call, you can skip term1 $ ccon-oci create --id container-1 term2 $ ccon-oci event -He created container-1 && ccon-oci start container-1 …create unblocks on term1… That seems pretty clean to me, and doesn't require much implementation |
The current state operation allows callers to poll for the state, but for some workflows polling is inefficient (how frequently do you poll to balance the cost of polling against the timeliness of the response?) and push notifications make more sense.
The runtime's create process is in a unique position to detect these status transitions.
As the actor carrying out container creation, it should have a clear idea of when that creation completes (for the
createdevent).It may setup a communication channel with the container process to orchestrate creation, and that channel may be used to report the start event.
It knows (or is) the parent of the container process, and POSIX's wait(3), waitpid(3), and waitid(3) only work for child processes. From wait(3):
So the runtime can setup (or be) the parent waiting on the container process, and arrange for the 'stopped' event to be published on container exit.
I've tried to phrase the requirements conservatively to allow for runtimes that have to poll their kernel or some such to notice these changes. I see the following runtime-support cases:
The runtime can easily supply a push-based event operation. In this case, exposing that operation to callers faciliates push-based workflows without much cost.
The runtime cannot supply a push-based event operation, and has to emulate it by polling. In this case, the runtime can pick a polling strategy that makes sense to its maintainers, and callers who aren't satisfied with that strategy can roll their own state poller without a big efficiency hit (in a lesser of two evils way).
The requirement is currently worded so weakly that a runtime would be compliant with:
stoppedevent with a 5pm (and some microseconds) timestamp.stoppedinstatus.which is a pretty low bar.
The runtime could supply a push-based event operation, but it would be a lot of work. These runtimes can use polling (2) as a quick-and-dirty solution until someone has time to implement a push-based solution (1). The improvement is transparent to users, who can use the same event operation throughout and passively reap the benefits of the implementation improvements.
Without an operation like this, higher levels that need to trigger on these transitions without polling need to exercise a lot of control over the system:
PR_SET_CHILD_SUBREAPERancestor of) the create process, and the create process needs to exit after creation completes, if they want to block on thecreatedevent.startedordeletedevents.PR_SET_CHILD_SUBREAPERif they want to block on thestoppedevent.By making
eventa runtime requirement, we allow for efficient cross-platform push-based workflows while avoiding the need for tight orchestrator gate-keeping. Runtimes that have implementation difficulties have an easy out that allows their callers to benefit from future implemenation improvements. And callers that are not satisfied can always fall back to polling state or the proxy/waitid/PR_SET_CHILD_SUBREAPERapproaches.The buffering requirement is very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. “buffer for 15 seconds” or “buffer the last 5 events” or “buffer the
createdevent”) seemed out of place at the level of detail in this specification.The goal is to allow for:
To support blocking on the event without racing on “maybe the
createdevent happened before the event operation attached”.Given the small number of required events, this buffering should not be a large resource concern, and
--event-buffer createdwould only ever require a single event to be buffered per container.There is still a race on “maybe the container has already been destroyed and a second container has been created with the same ID before the event operation attached”, but that seems much less likely (especially since the caller is free to pick UUIDs).
This builds on #507, so that should be reviewed first.
UPDATE: note that this pub/sub model does not require an additional long-running process. For an example implementation based on shared access to a directory (like runC uses for state) and inotify, see wking/inotify-pubsub.