A single binary to handle basic container creation. The goal is to produce a lightweight tool in C that can serve as a test-bed for Open Container Specification development. Ccon is thin wrapper around the underlying syscalls and kernel primitives. It makes it easy to apply a given configuration, but does not have an opinion about what a container should look like (it's even less opinionated than LXC).
When you invoke it from the command line, ccon clones a
child process to create any new namespaces declared in the config
file. The parent process continues running in the host namespace.
When the child process exits, the host process collects its exit
status and returns it to the caller. During an initial setup phase,
the two processes pass messages on a Unix socket to
synchronize the container setup. Here's an outline of the lifecycle:
| Host process | Container process |
|---|---|
| opens host executable | |
| opens namespace files | |
| clones child → | (clone unshares namespaces) |
| sets user-ns mappings | blocks on user-ns mappings |
| sends mappings-complete → | |
| blocks on full namespace | joins namespaces |
| mounts filesystems | |
| ← sends namespaces-complete | |
| runs pre-start hooks | blocks on exec-message |
| sends exec-message → | |
| opens the local ptmx | |
| ← sends pseudoterminal master | |
| waits on child death | executes user process |
| splicing standard streams | … |
| onto the pseduoterminal master | |
| dies | |
| collects child exit code | |
| runs post-stop hooks | |
| exits with child's code |
A number of those steps are optional. For details, see the relevant
section in the configuration specification. In
general, leaving out a particular value
(e.g. namespaces.user.setgroups or
namespaces.mount.mounts) will result in that potential action
(e.g. writing to /proc/{pid}/setgroups or
calling mount) being skipped, while the rest of ccon
carries on as usual.
Users who need to join namespaces before unsharing namespaces can
use nsenter or a wrapping ccon invocation to join those
namespaces before the main ccon invocation creates the new mount
namespace.
Ccon is similar to an Open Container Specification runtime in
that it reads a configuration file named config.json from its
current working directory. However the JSON content is a bit
different to highlight how the components relate to each-other on
Linux. For example, setting per-container mounts requires a mount
namespace, so ccon's mount listing falls under
namespaces.mount.mounts. There's an example in
config.json that unprivileged users should be able to
use to launch an interactive BusyBox shell in new namespaces (you
may need to adjust the hostID entries to match id -u and id -g).
If you want to use ccon to launch OCI bundles, you can use the ccon-oci wrapper (example), which supports the Open Container Specification and the runtime command-line API.
You can load the configuration from a different file by giving its
path with the --config option. For example:
$ ccon --config path/to/config.json
or:
$ ccon --config /dev/fd/4 4<path/to/config.json
or (using Bash's process substitution):
$ ccon --config <(echo '{"version": "0.4.0", "process": …}')
You can also specify the config JSON directly on the command line with
--config-string, which may be convenient in situations where using
pipes or process substitution are too awkward:
$ ccon --config-string '{"version": "0.4.0", "process": …}'
There are additional examples focusing on specific tasks in the
examples/ directory.
The ccon version represented in the config file.
version(required, SemVer 2.0.0 string)
"version": "0.4.0"A set of namespaces to be created or joined by the container process.
Keys match the long-form options from unshare and
nsenter without their leading hyphens. For each
namespace entry, the presence of a path key means the container
process will join an existing namespace at the absolute path specified
by the path value. The absence of a path key means a new
namespace will be created. There may be additional per-namespace
configuration in the namespace object. If there is no
namespaces entry or its value is an empty object, the container
process will inherit all its namespaces from the host process.
Similarly, if a particular namespaces entry is missing
(e.g. user), the container process will
inherit that namespace from the host process.
namespaces(optional, object) containing entries for each new or joined namespace.
"namespaces": {
"uts": {},
"net": {"path": "/proc/2186/ns/net"},
"user": {"setgroups": false}
}Which will create new UTS and
user namespaces, join the network namespace at
/proc/2186/ns/net, and disable setgroups in the new
user namespace.
New user namespaces support the
/proc/{pid}/{path} files setgroups, uid_map, and gid_map
discussed in user_namespaces(7).
user(optional, object) which may contain:path(optional, string) the absolute path to a network namespace which the container process should join.setgroups(optional, boolean) whether to enable or disablesetgroups. Implemented by writing to/proc/{pid}/setgroups.uidMappings(optional, array of objects) maps user IDs between the new namespace and its parent namespace. Implemented by writing to/proc/{pid}/uid_map. Array entries are objects with the following fields:containerID(required, integer) is the start of the mapped UID range in the new namespace.hostID(required, integer) is the start of the mapped UID range in the parent namespace.size(required, integer) is the length of the range of mapped UIDs.
gidMappings(optional, array of objects) maps group IDs between the new namespace and its parent namespace. Implemented by writing to/proc/{pid}/gid_map. Array entries are objects with the following fields:containerID(required, integer) is the start of the mapped GID range in the new namespace.hostID(required, integer) is the start of the mapped GID range in the parent namespace.size(required, integer) is the length of the range of mapped GIDs.
"user": {
"setgroups": false,
"uidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
],
"gidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
]
},Which will disable setgroups and map the host user
and group 1000 to the container user and group 0.
New mount namespace support the creation of arbitrary mounts, assuming the caller has sufficient privileges for the underlying syscall. The user namepace documentation outlines the mount permissions for processes inside a user namespace.
mount(optional, object) which may contain:path(optional, string) the absolute path to a network namespace which the container process should join.mounts(optional, array) an ordered list of mounts to perform. Array entries are objects with fields based on themountcall:type(string) of mount (seefilesystems(5)).source(string) path of mount. This may be optional or required depending ontype.target(string, required) path of the mount being created or manipulated.flags(array of strings, optional)MS_*flags to set.data(string, optional) type-specific data for the mount.
If they don't start with a slash, source and target are
interpreted as paths relative to ccon's current working
directory.
In addition to the usual types supported by mount, ccon
supports a pivot-root type that invokes the
pivot_root syscall, shifting the old
root to a temporary (after which it is unmounted and the temporary
directory is removed). In that case, the only other field that
matters is source, which specifies
"mount": {
"mounts": [
{
"source": "rootfs",
"target": "rootfs",
"flags": [
"MS_BIND"
]
},
{
"source": "/etc/resolv.conf",
"target": "rootfs/etc/resolv.conf",
"flags": [
"MS_BIND"
]
},
{
"source": "root",
"target": "rootfs/root",
"flags": [
"MS_BIND"
]
},
{
"source": "rootfs",
"type": "pivot-root"
}
]
}Which will bind ${PWD}/rootfs to itself (the “trick” mentioned in
switch_root(8) which we need for the later
pivot), bind the host's resolv.conf onto
${PWD}/rootfs/etc/resolv.conf, bind ${PWD}/root onto
${PWD}/rootfs/root, and pivot to make ${PWD}/rootfs the container
root.
There is no special configuration for the PID namespace, although if you are creating both a PID and a mount namespace, you probably want mount entries along the lines of:
{
"target": "/proc",
"flags": [
"MS_PRIVATE",
"MS_REC"
]
},
{
"target": "/proc",
"type": "proc",
"flags": [
"MS_NOSUID",
"MS_NOEXEC",
"MS_NODEV"
]
}For more details, see the “/proc and PID namespaces” section of
pid_namespaces(7).
pid(optional, object) which may contain:path(optional, string) the absolute path to a PID namespace which the container process should join.
There is no special configuration for the network namespace.
net(optional, object) which may contain:path(optional, string) the absolute path to a network namespace which the container process should join.
There is no special configuration for the IPC namespace.
ipc(optional, object) which may contain:path(optional, string) the absolute path to an IPC namespace which the container process should join.
There is no special configuration for the UTS
namespace, although future work might build in support
for sethostname.
uts(optional, object) which may contain:path(optional, string) the absolute path to a UTS namespace which the container process should join.
After the container setup is finished, the container process can
optionally adjust its state and execute the configured code. If
process isn't specified, the container process will exit (with
an exit code of zero) instead of executing a user process (which can
be useful for the creation phase of a workflow that separates creation
from execution).
process(optional, object) configuring the container process after the container is setup.
"process": {
"args": ["busybox", "sh"]
}Which will execvpe a BusyBox shell with the host
process's user and group (possibly mapped by the user
namespace), working directory, and environment.
If you launch ccon from a terminal (e.g. tty or test -t 0 return zero), your standard input is already a
terminal and you probably don't need to worry about this setting. If
you launch ccon from a non-terminal process (e.g. from a webserver
that is communicating with the user over a socket), you may want to
create a UNIX 98 psuedoterminal to do things like translate
the user's control-C into SIGINT for the container.
Containers that do not pivot root or who otherwise
keep access to the host ptmx can create such a pseudoterminal
by calling opening the ptmx (e.g. with
posix_openpt).
Containers that are pivoting to a new root and mounting their devpts with newinstance will want to ensure that the pseudoterminal is created using a devpts instance that will be accessible after the pivot, and there are a number of issues to consider.
terminal(optional, boolean) if true, the process will open its local/dev/ptmx(e.g. withposix_openpt),dupthe pseudoterminal slave over its standard streams, and send the pseudoterminal master back to the host process. The host process will continually copy its standard input to that pseudoterminal master and the pseudoterminal master to its standard output.
"args": ["sh"],
"terminal": trueAdjust the user and group IDs before executing the user-specified code.
uid(optional, integer) tosetuida different user.gid(optional, integer) tosetgida different group.additionalGids(optional, array of integers) forsetgroups. See alsonamespaces.user.setgroups.
"user": {
"uid": 0,
"gid": 0,
"additionalGids": [5, 6]
}Which will lead to a container process with id output like:
uid=0(root) gid=0(root) groups=0(root),5(tty),6(disk)
Change to a different directory before executing the configured code.
cwd(optional, string) tochdirto a different directory. If unset, the current directory will remain the same as the caller's working directory, unless there is apivot-rootentry innamespaces.mount.mounts, in which case the default working directory will be the new root.
"cwd": "/root"Define the minimum set of capabilities required for the container process. All other capabilities are dropped from all capabilities sets, including the bounding set, before executing the configured code.
capabilities(optional, array of strings) Set ofCAP_*flags to set.
If unset, the container process will continue with the caller's capabilities (potentially increased in a child user namespace).
"capabilities": [
"CAP_NET_BIND_SERVICE",
"CAP_NET_RAW"
]The command that the container process executes after container setup
is complete. The process will inherit any open file descriptors; for
example the standard streams (unless
terminal is true) or systemd's
SD_LISTEN_FDS_START.
args(optional, array of strings) holds command-line arguments passed toexecvpe. The first argument (args[0]) is also used as the path, unlesspathis set.
If unset, the container process will exit with status zero instead of executing new code (see Process).
"args": [
"nginx",
"-c",
"/nginx.conf"
]Which will execute an Nginx server using the configuration in
/nginx.conf.
Override args[0] with an alternate path (but the executed code
will still see args[0] as its first argument).
path(optional, string) sets the path to the executed command. Paths without slashes will be resolved using thePATHenvironment variable.
"args": ["sh"],
"path": "busybox"Which will execute the first busybox executable found in
your PATH with its argv[0] set to sh.
Instead of looking up args[0] (or
path) in the container mount namespace, look it up in
the host mount namespace using the host PATH. This allows you to
launch (via execveat, so you need Linux
3.19+) a statically-linked init process that
only exists on the host.
"args": ["sh"],
"path": "busybox",
"host": trueWhich will execute the first busybox executable found in
your PATH with its argv[0] set to sh.
Override the host environment.
env(optional, array of strings) holds environment settings forexecvpe.
If unset, the container process will use the environ
it inherited from the host.
"env": [
"PATH=/bin:/usr/bin",
"TERM=xterm"
]Which will set PATH and TERM.
Not all container-related functionality is built into ccon (the only
setup handled by the host process is the /proc/{pid}/setgroups,
etc., writes for user namespaces. For example,
control group manipulation and veth network
configuration should be handled with external tools.
What ccon provides are hooks so you can call those external tools at
the appropriate point in the lifecycle.
hooks(optional, object) configuring the hooks run for each hook-triggering event.
"hooks": {
"pre-start": [
{
"args": [
"echo",
"I'm a pre-start hook"
]
}
],
"post-stop": [
{
"args": [
"echo",
"I'm a post-stop hook"
]
}
]
}Which will just print messages to the host process's stdout for each hook-triggering event.
Hooks run after the container setup is complete but before the
configured process is executed. This is useful for
additional container configuration (e.g. creating cgroups or
performing network setup)
pre-start(optional, array of objects) holds process objects (likeprocessexcept for stdin handling and the lack ofhost) to run after the pre-start event.
Each hook receives the container process's PID in the host PID
namespace on its stdin. Its stdout and
stderr are inherited from the host process (unless
terminal is true). The hooks are executed in the
listed order, the host process waits until each hook exits before
executing the next, and a nonzero exit code from any hook will cause
the host process to abandon further hook execution,
SIGKILL the container process. The host process resumes
the usual lifecycle at “waits on child death”.
"pre-start": [
{
"args": [
"mkdir",
"-p",
"/sys/fs/cgroup/unified/nginx-0/container"
]
},
{
"args": [
"tee",
"/sys/fs/cgroup/unified/nginx-0/container/cgroup.procs"
]
}
]Which will create new nginx-0 and nginx-0/container cgroups in the
unified hierarchy (if they don't already exist) and
add the container process to that cgroup.
Hooks run after the host process has reaped the container process. You could handle this in the shell with:
$ ccon; post_stop_hook_1; post_stop_hook_2
but the most common use will be cleaning up after pre-start hooks, and it's nice to configure both in the same place (the ccon config file).
post-stop(optional, array of objects) holds process objects (likeprocessexcept for the lack ofhost) to run after the post-stop event.
Its standard streams are inherited from the host process
(unless terminal is true). The hooks are executed
in the listed order, the host process waits until each hook exits
before executing the next, and a nonzero exit code from any hook will
cause the host process to print a message to stderr, after which it
continues as if the hook had exited with zero.
"post-stop": [
{
"args": [
"rmdir",
"/sys/fs/cgroup/unified/nginx-0/container"
]
},
{
"args": [
"rmdir",
"/sys/fs/cgroup/unified/nginx-0"
]
}
]Which will remove nginx-0/container and nginx-0 cgroups (such as
those created by the pre-start example. This will
only succeed if the namespaces are empty, so if you were using this in
production it would be best to:
- Ensure there were no other processes in those cgroups (e.g. by
creating a new PID namespace and adding all
additional processes to that namespace before adding them to the
nginx-0cgroup tree) - Use a tool like
cgdeleteto recursively removenginx-0, which would also remove additional child cgroups beyondnginx-0/containerthat may have been added by other processes sincenginx-0was created.
- Linux headers for 3.19+ for
execveat(sys-kernel/linux-headers on Gentoo). - The GNU C Library (sys-libs/glibc on Gentoo).
- Jansson for JSON parsing (dev-libs/jansson on Gentoo).
- libcap-ng for adjusting capabilities (sys-libs/libcap-ng on Gentoo).
Ccon is pretty easy to compile, but to use the stock Makefile, you'll need:
- A C compiler like GCC (sys-devel/gcc on Gentoo).
- GNU Make (sys-devel/make on Gentoo).
- pkg-config (dev-util/pkgconfig on Gentoo).
- indent (dev-util/indent on Gentoo). Invoke with
make fmt.
- Ccon is under the GPLv3+.
- Glibc is under the LGPL-2.1+.
- Jansson is under the MIT license.
- libcap-ng is under the LGPL-2.1+.
Because all the dependencies are GPL-compatible, ccon binaries can be distributed under the GPLv3+.