-
Notifications
You must be signed in to change notification settings - Fork 692
Allow CRIU to be used as non-root with CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN #1155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/zdtm.py
Outdated
| help="Use splice or read mode of pre-dumping", | ||
| choices=['splice', 'read'], | ||
| default='splice') | ||
| rp.add_argument("--non-root", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have the --user option. How is this one different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all. I was not aware of --user. I have updated this PR to use --user instead of --non-root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename --user to --non-root? user could mean anything. (criU, user namespaces). Not at all obvious of what it would do. --unprivileged is pretty excplicit too. I'd argue that Adrian might have seen the option if the naming of the option was unambiguous. Our users might miss it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the history behind --user? I see now that is has been used in one of the jenkins jobs which I am not familiar with.
Is --user already testing exactly the same I wanted to test with this PR? Running CRIU with CAP_CHECKPOINT_RESTORE/CAP_SYS_ADMIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the history behind --user?
It was Pavel's attempt to implement criu dump that can't be executed by non-root users to dump their processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest --rootless? It is an established term and the meaning is very clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first patch series was using --non-root, now I am currently using --user. --rootless is indeed a good idea, but I do not really care which name it is.
Should we re-use --user or give it a new name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of the --user option is to allow non-root users dumping processes even when criu doesn't have any capabilities. We want to save this functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed my changes to --rootless.
| val = getenv("ZDTM_GROUPS"); | ||
| if (val) { | ||
| char *tok = NULL; | ||
| unsigned int size = 0, groups[NGROUPS_MAX]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: NGROUPS_MAX is 65536
Isn't it will be better to allocate memory from heap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me moving code behind an if if running as non-root. I have not changed the actual code. If this should be changed it should happen in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just me moving code behind an
ifif running as non-root. I have not changed the actual code. If this should be changed it should happen in another PR.
no problem. It's just comment.
…cm/linux/kernel/git/brauner/linux
Pull checkpoint-restore updates from Christian Brauner:
"This enables unprivileged checkpoint/restore of processes.
Given that this work has been going on for quite some time the first
sentence in this summary is hopefully more exciting than the actual
final code changes required. Unprivileged checkpoint/restore has seen
a frequent increase in interest over the last two years and has thus
been one of the main topics for the combined containers &
checkpoint/restore microconference since at least 2018 (cf. [1]).
Here are just the three most frequent use-cases that were brought forward:
- The JVM developers are integrating checkpoint/restore into a Java
VM to significantly decrease the startup time.
- In high-performance computing environment a resource manager will
typically be distributing jobs where users are always running as
non-root. Long-running and "large" processes with significant
startup times are supposed to be checkpointed and restored with
CRIU.
- Container migration as a non-root user.
In all of these scenarios it is either desirable or required to run
without CAP_SYS_ADMIN. The userspace implementation of
checkpoint/restore CRIU already has the pull request for supporting
unprivileged checkpoint/restore up (cf. [2]).
To enable unprivileged checkpoint/restore a new dedicated capability
CAP_CHECKPOINT_RESTORE is introduced. This solution has last been
discussed in 2019 in a talk by Google at Linux Plumbers (cf. [1]
"Update on Task Migration at Google Using CRIU") with Adrian and
Nicolas providing the implementation now over the last months. In
essence, this allows the CRIU binary to be installed with the
CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
unprivileged users to restore processes.
To make this possible the following permissions are altered:
- Selecting a specific PID via clone3() set_tid relaxed from userns
CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Selecting a specific PID via /proc/sys/kernel/ns_last_pid relaxed
from userns CAP_SYS_ADMIN to CAP_CHECKPOINT_RESTORE.
- Accessing /proc/pid/map_files relaxed from init userns
CAP_SYS_ADMIN to init userns CAP_CHECKPOINT_RESTORE.
- Changing /proc/self/exe from userns CAP_SYS_ADMIN to userns
CAP_CHECKPOINT_RESTORE.
Of these four changes the /proc/self/exe change deserves a few words
because the reasoning behind even restricting /proc/self/exe changes
in the first place is just full of historical quirks and tracking this
down was a questionable version of fun that I'd like to spare others.
In short, it is trivial to change /proc/self/exe as an unprivileged
user, i.e. without userns CAP_SYS_ADMIN right now. Either via ptrace()
or by simply intercepting the elf loader in userspace during exec.
Nicolas was nice enough to even provide a POC for the latter (cf. [3])
to illustrate this fact.
The original patchset which introduced PR_SET_MM_MAP had no
permissions around changing the exe link. They too argued that it is
trivial to spoof the exe link already which is true. The argument
brought up against this was that the Tomoyo LSM uses the exe link in
tomoyo_manager() to detect whether the calling process is a policy
manager. This caused changing the exe links to be guarded by userns
CAP_SYS_ADMIN.
All in all this rather seems like a "better guard it with something
rather than nothing" argument which imho doesn't qualify as a great
security policy. Again, because spoofing the exe link is possible for
the calling process so even if this were security relevant it was
broken back then and would be broken today. So technically, dropping
all permissions around changing the exe link would probably be
possible and would send a clearer message to any userspace that relies
on /proc/self/exe for security reasons that they should stop doing
this but for now we're only relaxing the exe link permissions from
userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.
There's a final uapi change in here. Changing the exe link used to
accidently return EINVAL when the caller lacked the necessary
permissions instead of the more correct EPERM. This pr contains a
commit fixing this. I assume that userspace won't notice or care and
if they do I will revert this commit. But since we are changing the
permissions anyway it seems like a good opportunity to try this fix.
With these changes merged unprivileged checkpoint/restore will be
possible and has already been tested by various users"
[1] LPC 2018
1. "Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=12095
2. "Securely Migrating Untrusted Workloads with CRIU"
https://www.youtube.com/watch?v=yI_1cuhoDgA&t=14400
LPC 2019
1. "CRIU and the PID dance"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=2m48s
2. "Update on Task Migration at Google Using CRIU"
https://www.youtube.com/watch?v=LN2CUgp8deo&list=PLVsQ_xZBEyN30ZA3Pc9MZMFzdjwyz26dO&index=9&t=1h2m8s
[2] checkpoint-restore/criu#1155
[3] https://github.com/nviennot/run_as_exe
* tag 'cap-checkpoint-restore-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
selftests: add clone3() CAP_CHECKPOINT_RESTORE test
prctl: exe link permission error changed from -EINVAL to -EPERM
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
pid: use checkpoint_restore_ns_capable() for set_tid
capabilities: Introduce CAP_CHECKPOINT_RESTORE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
We need solid testing without CAP_SYS_ADMIN, where we run with just with CAP_CHECKPOINT_RESTORE and CAP_PTRACE, as advertised. Otherwise, we can't develop properly. This may require us to use a custom kernel (could be compiled and put as a binary blob in some repo we own) that we run with qemu.
-
CLI options: should we require users to manually disable features (like network), or automatically disable features when we see we don't have the right capability. Another way to look at this is to be tolerant as much as we can to EPERM errors.
-
CRIU should not look at uid. Rather, it should look at its effective capabilities (and perhaps whether it is running in a user namespace, as it makes all the
capable(CAP_*)tests fail in the kernel, which matters for things likeSO_SNDBUFFORCE). -
At some point, we should motivate the docker/kubernetes people to add CAP_CHECKPOINT_RESTORE in the set of bounded capabilities (so users don't have to add it manually).
I can help
criu/cr-check.c
Outdated
| * CAP_CHECKPOINT_RESTORE and CAP_PTRACE. | ||
| */ | ||
|
|
||
| fd = open(criu, O_RDONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to check for the caps of the CRIU binary. Even if the binary has some setcap magic bits on it, we might not have the effective capabilities we want, because the bounding capabilities may be restricted. (For example, in kubernetes, one must specify the capabilities that can be activated within the container in the yaml file, regardless of the setcap attrs of the binaries in the image)
We should look at /proc/self/status and see what effective capabilities we have. Using the capget() syscall is another option. Depending if we have CAP_SYS_ADMIN, CAP_CHECKPOINT_RESTORE, or CAP_NET_ADMIN, we may want to do a variety of things.
Parsing /proc/self/status should be fairly easy, as there is already one parse_pid_status(), which parses the capabilities.
Replacing opts.uid with opts.caps would be helpful: instead of doing tests like if (opts.uid), we could do tests like if (opts.caps & CAP_NET_ADMIN)
XXX We might want to test if we are in a user namespace, and not in the root namespace. If that's the case, then a bunch of tests won't work though. If we are in a user namespace, all the kernel checks that are if (capable(CAP_*)) will return false. I'm not sure how to test if we are in a user_ns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
criu already should have a helper to test whether or not it is running a user namespace. If not, what should work is sm like:
inline static bool in_userns(void)
{
__do_fclose FILE *f = NULL;
uid_t user, host, count;
int ret;
/* Now: are we in a user namespace? Because then we're also
* unprivileged.
*/
f = fopen("/proc/self/uid_map", "re");
if (!f)
return false;
ret = fscanf(f, "%u %u %u", &user, &host, &count);
if (ret != 3)
return false;
return user != 0 || host != 0 || count != UINT32_MAX;
}
criu/cr-check.c
Outdated
| out_close: | ||
| close(fd); | ||
| if (exit_code) { | ||
| pr_msg(" CRIU needs to be run as root or needs to have the CAP_SYS_ADMIN or\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One call instead of three, and pr_msg -> pr_error. Also I wouldn't go with the two space of indent.
so:
pr_error(
"CRIU...\n"
"2nd line...\n"
"3rd line...\n");
| * capabilities available. Now we have to make the process dumpable | ||
| * so that /proc/self is not owned by root. | ||
| */ | ||
| if (pr_set_dumpable(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With capabilities checks, we could either do set_dumpable() in all cases. If we wanted to put it behind an if, we could check if we have CAP_DAC_OVERRIDE
criu/fdstore.c
Outdated
|
|
||
| if (setsockopt(sk, SOL_SOCKET, SO_SNDBUFFORCE, &buf[0], sizeof(buf[0])) < 0 || | ||
| setsockopt(sk, SOL_SOCKET, SO_RCVBUFFORCE, &buf[1], sizeof(buf[1])) < 0) { | ||
| if (opts.uid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could check CAP_NET_ADMIN for this one but I'm not sure how this work in practice. The kernel checks if we have CAP_NET_ADMIN in the root namespace. This test can also succeed if we are not in a user namespace. I'm not sure how to test that we are in a user name space
Maybe it's better to try to use SO_SNDBUFFORCE, and if we can't due to -EPERM, we fallback to SO_SNDBUF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we need to check CAP_NET_ADMIN here. uid can be 0 but setsockopt(SO_RCVBUFFORCE) will fail without CAP_NET_ADMIN.
criu/files.c
Outdated
|
|
||
| static int fchroot(int fd) | ||
| { | ||
| if (opts.uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe test CAP_SYS_CHROOT (here the kernel tests CAP_SYS_CHROOT in the current user namespace).
|
|
||
| static int restore_creds(struct thread_creds_args *args, int procfd, | ||
| int lsm_type) | ||
| int lsm_type, uid_t uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of things going in there that can fail if we have limited creds: . sys_setresuid() and friends, sys_prctl(PR_CAPBSET_DROP, etc.) / sys_capset(). SELinux things.
I think we need testing without CAP_SYS_ADMIN, and just CAP_CHECKPOINT_RESTORE. Can we boot a kernel that has the new cap on travis ci via qemu? Otherwise, I'm afraid we are not going to be effective in writing this non-root functionality.
Thanks to @brauner the CAP_CHECKPOINT_RESTORE changes have been merged and should be available with RC1 (torvalds/linux@74858ab). As we are already using Vagrant to test the latest Fedora release we can easily add another Vagrant based test and install the latest RC1 kernel via https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories which provides all RC releases:
Any reason to not automatically disable features if we already detect that it will not work?
Sounds good. Like you mentioned above looking at
Thanks. In your review you mentioned a lot of different capabilities we should test and this sounds like a good approach. I would start with reading |
Sometimes, failing can be preferable:
So I'd argue that we want a deterministic behavior given a set of CLI options. I see two approaches: b) is a superset of a) and would be easy to use for users. Most importantly, the behavior of CRIU is deterministic given a list of CLI options. |
criu/cr-check.c
Outdated
| * CAP_CHECKPOINT_RESTORE and CAP_PTRACE. | ||
| */ | ||
|
|
||
| fd = open(criu, O_RDONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
criu already should have a helper to test whether or not it is running a user namespace. If not, what should work is sm like:
inline static bool in_userns(void)
{
__do_fclose FILE *f = NULL;
uid_t user, host, count;
int ret;
/* Now: are we in a user namespace? Because then we're also
* unprivileged.
*/
f = fopen("/proc/self/uid_map", "re");
if (!f)
return false;
ret = fscanf(f, "%u %u %u", &user, &host, &count);
if (ret != 3)
return false;
return user != 0 || host != 0 || count != UINT32_MAX;
}|
I created a vagrant based test using a 5.8.0 kernel for now. Once the Fedora vanilla kernel mainline repository picks up 5.9.0-rc1 the test should work. Right now I added the test more as a proof of concept. To see how it could look like. |
criu/kerndat.c
Outdated
| { | ||
| int fd, ret; | ||
|
|
||
| if (opts.uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a comment why we can't load cache in this case.
a5eda57 to
4cd15dd
Compare
|
Test is now running on Fedora with 5.9.0-rc0 kernel and CAP_CHECKPOINT_RESTORE |
4cd15dd to
6a6f93e
Compare
|
I rebased this and pushed a new version in which I tried to switch to using /proc/pid/status. |
6a6f93e to
e791f7f
Compare
|
Pushed again to fix a CI error with usernsd. |
|
@nviennot Any comments from your side to the current patches? |
|
|
||
| if (opts.uid) | ||
| return; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will cause pre-dump to fail for non-root users with the following error for processes with larger (e.g., 1GB) memory state.
Error (criu/page-pipe.c:120): page-pipe: Can't make pipe for page-pipe: Too many open files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adrianreber you marked it as resolved, but for me it is unclear how it was resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about pre-dump which is out of scope of this PR.
|
@adrianreber running |
|
@rst0git What does happen if you run pre-dump as non-root before my patches? I am aware that my patches are only touching a small part of what needs to be done to have the complete code base ready for CAP_CHECKPOINT_RESTORE. But it at least provides a working checkpoint and restore implementation as non-root. I would prefer an incremental approach on the |
@adrianreber I can confirm that the same error occurs before applying the patches in this PR and I agree with you that incremental approach on the |
So, here's the enhanced version of the first try.
Changes are:
1. The wrapper name is criu-ns instead of crns.py
2. The CLI is absolutely the same as for criu, since the script
re-execl-s criu binary. E.g.
scripts/criu-ns dump -t 1234 ...
just works
3. Caller doesn't need to care about substituting CLI options,
instead, the scripts analyzes the command line and
a) replaces -t|--tree argument with virtual pid __if__ the
target task lives in another pidns
b) keeps the current cwd (and root) __if__ switches to another
mntns. A limitation applies here -- cwd path should be the
same in target ns, no "smart path mapping" is performed. So
this script is for now only useful for mntns clones (which
is our main goal at the moment).
Signed-off-by: Pavel Emelyanov <[email protected]>
Looks-good-to: Andrey Vagin <[email protected]>
This adds the function check_caps() which checks if CRIU is running with at least CAP_CHECKPOINT_RESTORE. That is the minimum capability CRIU needs to do a minimal checkpoint and restore from it. In addition helper functions are added to easily query for other capability for enhanced checkpoint/restore support. Signed-off-by: Adrian Reber <[email protected]>
This commit enables checkpointing and restoring of applications as non-root. First goal was to enable checkpoint and restore of the env00 and pthread00 test case. This uses the information from opts.uid and opts.cap_eff to skip certain code paths which do not work as non-root. The kerndat file is saved in $HOME/.criu.kdat. As $HOME is usually not on a tmpfs there is now a check for nodename and release to see if the system has been rebooted since last writing the kerndat cache file and mark it as stale. Signed-off-by: Adrian Reber <[email protected]>
This adds the non-root section and information about the parameter --unprivileged to the man page. Signed-off-by: Adrian Reber <[email protected]>
This are the minimal changes to make zdtm.py successfully run the env00 and pthread test case as non-root using the '--rootless' zdtm option. Signed-off-by: Adrian Reber <[email protected]>
Run env00 and pthread00 test as non-root as initial proof of concept. Signed-off-by: Adrian Reber <[email protected]>
|
Rebased to solve a merge conflict. So what is still missing to merge this? |
| unsigned int n_ctls = 0; | ||
| struct cg_set *cs; | ||
|
|
||
| if (opts.uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be "if (opts.unprivileged)" ?
| CgroupEntry cg = CGROUP_ENTRY__INIT; | ||
| int ret = -1; | ||
|
|
||
| if (opts.uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opts.unprivileged and we need to print a warning that cgroups are not dumped.
|
|
||
| he->has_root_cg_set = true; | ||
| if (!opts.uid) | ||
| he->has_root_cg_set = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very very suspicious. We set has_root_cg_set only if uid is zero, but we always call dump_task_cgroup. If here is any reasons for this, we need to write a comment here.
When I suggested to add the --unprivileged options, I mean that we will check opts.unprivileged instead of opts.uid.
| * allow to write to KDAT_RUNDIR which usually is only writable by root. | ||
| * Let's write .criu.kdat file to the home directory for non-root cases. | ||
| */ | ||
| if (opts.uid) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opts.unprivileged. opts.uid doesn't guarantee that criu has been executed with a full set of capabilities.
|
|
||
| /* | ||
| * For the non-root use case we use uname(2) to decide | ||
| * if a reboot happened. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can uname help up to detect a reboot? Do you mean a kernel update?
|
Hello @avagin, I think it would make more sense if you take over this PR. You have a much clearer understanding and vision about what needs to be done than myself. Just take what already exist and make the necessary changes. You can close this PR or modify it. Whatever works best for you. |
|
A friendly reminder that this PR had no activity for 30 days. |
|
The changes in this PR have been used to enable rootless checkpoint/restore support in OpenLiberty/open-liberty#20683 |
With the almost finished discussion about CAP_CHECKPOINT_RESTORE I reworked my test code to be ready for review.
This contains the changes to run CRIU as non-root with either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. As Travis does not yet support CAP_CHECKPOINT_RESTORE I am using CAP_SYS_ADMIN during Travis testing, but locally it works just the same with CAP_CHECKPOINT_RESTORE.
This PR checks if the euid is non-zero and if it is non-zero it checks if CRIU has either CAP_SYS_ADMIN or CAP_CHECKPOINT_RESTORE. If that is true CRIU remembers the euid and decides at different places in the code to not do certain things which do not work as non-root.
This is a minimal start of non-root CRIU and it can be used with zdtm to test with env00. As seen in the patch.
This is based on @nviennot and my test branches for the lkml CAP_CHECKPOINT_RESTORE discussions.