Support running the Engine daemon inside a user namespace by hallyn · Pull Request #20902 · moby/moby

hallyn · 2016-03-03T08:02:27Z

Two commits to accomodate things we cannot do when running docker from a container which is in a user namespace. Combined with one more patch to libcontainer itself, this suffices to run docker inside a user namespaced container.

vdemeester · 2016-03-03T08:08:05Z

Hi @hallyn, thanks for contribution. Could you do the following :

Using a better title for the PR "2016 03 02/nest.partial" doesn't really help us to know what it relates to ("Run docker daemon inside a docker namespace enabled engine" could be better).
We do not change anything that is in vendor/ directory is not, changes on those must be done upstream (in opencontainers/runc in your case).
Please use gofmt -s -w on the file changed

09:05:06 ---> Making bundle: validate-gofmt (in bundles/1.11.0-dev/validate-gofmt)
09:05:07 These files are not properly gofmt'd:
09:05:07  - pkg/archive/archive.go

hallyn · 2016-03-03T08:52:11Z

D'oh! the third patch (which edits vendor/) was supposed to be removed and is in fact already a PR against opencontainers/runc. Removing it for real now.

thaJeztah · 2016-03-03T10:03:40Z

ping @estesp for user namespaces

thaJeztah · 2016-03-03T11:15:57Z

Got panics on CI

09:14:51 Error response from daemon: Untar re-exec error: exit status 2: output: panic: runtime error: invalid memory address or nil pointer dereference
09:14:51 [signal 0xb code=0x1 addr=0x88 pc=0x90bcf6]

estesp · 2016-03-03T16:02:57Z

pkg/chrootarchive/archive_unix.go

This is the cause of the test panic as options is still nil at this point.

However, I assume this is actually being passed in via the JSON "options" object and shouldn't be set here anyway, so maybe a stray leftover from testing?

hallyn · 2016-03-03T18:11:47Z

Hm, I don't know what happened there, that doesn't look like my original patch... checking.

estesp · 2016-03-03T21:29:54Z

Last step here is to handle cross-platform issue with the RunningInUserNS:

19:09:09 pkg/archive/archive_unix.go:84: undefined: "github.com/opencontainers/runc/libcontainer/system".RunningInUserNS

Running make cross will reveal the issue (the libcontainer/system/linux.go functions are only compiled on linux, but pkg/archive/archive_unix.go is all Unix-like systems).

I think the cleanest way is a runC PR to have a stub for non-Linux that returns false automatically.

estesp · 2016-03-03T21:36:14Z

@hallyn I created opencontainers/runc#620 to solve this; I'm hoping to vendor in opencontainers/runc very soon for the shared namespace work that is now in libcontainer/nsenter; so hopefully we can get this solved ASAP.

hallyn · 2016-03-04T08:09:43Z

Thanks @estesp ! Do I understand right that I should just wait for #620 to clear?

thaJeztah · 2016-03-04T09:10:01Z

@hallyn you can include the change from opencontainers/runc#620 in your PR by adding the file here https://github.com/docker/docker/tree/master/vendor/src/github.com/opencontainers/runc/libcontainer/system. This will make the "vendor" CI check fail, but the other tests should then complete successfully. Once the RunC PR is merged, and vendored, you can rebase your PR to make the vendor check pass

hallyn · 2016-03-05T04:05:04Z

Thanks, done, will watch for the runc pr merge...

tianon · 2016-03-08T23:09:35Z

pkg/chrootarchive/diff_unix.go

options is nil until the line further down, so this fails with panic: runtime error: invalid memory address or nil pointer dereference 😄

actually, even worse -- options doesn't appear to ever be set to non-nil within this function

? the diff I have here has options.InUserNS = true happening after the unmarshalling

Hmm, @mwhudson managed to trigger a panic here on s390x, so I wonder if there's some edge case where json.Unmarshal doesn't create the struct?

Ah nevermind, that's a different set of changes from v1.10.2...hallyn:v1.10.0.serge.2#diff-4166b9ad558bc9c8d0ff7b01b69c128aR34 causing that one -- carry on! 👍

Quoting Tianon Gravi ([email protected]):

@@ -49,6 +51,10 @@ func applyLayer() {
fatal(err)
}

if inUserns {

options.InUserNS = true

(even still, couldn't this just be options.InUserNS = inUserns? or does setting it regardless like that have other consequences?)

nu, bc we must detect it before the chroot.

Quoting Tianon Gravi ([email protected]):

@@ -49,6 +51,10 @@ func applyLayer() {
fatal(err)
}

if inUserns {

options.InUserNS = true

Ah nevermind, that's a different set of changes from v1.10.2...hallyn:v1.10.0.serge.2#diff-4166b9ad558bc9c8d0ff7b01b69c128aR34 causing that one -- carry on! 👍

yeah i had it wrong originally

fwiw, I triggered this by running "sudo adt-run --unbuilt-tree . --- null". and I just tried the same on arm64 with the same results, so it would probably happen on amd64 too...

Exactly which commit id were you using?

That was my fault -- he was testing on essentially ab8e54b, which is missing 45afdc9af9c300887cfd9952e04f5ba7d500f4d9 (which fixes the panic).

estesp · 2016-03-17T21:46:42Z

@hallyn we need to wait until the containerd integration PR merges (#20662) to get rid of your vendor commit--that PR will update the opencontainers/runc vendor to include the needed changes.

In the meantime, what do you think about logging the "skips" for unsupported file types when tarring/untarring in the archive packages? That way there will be some record if bugs come up in the future about "missing" content?

thaJeztah · 2016-03-21T12:25:45Z

ping @hallyn #20662 was merged, can you rebase and address the comments (if any needs addressing?)

hallyn · 2016-03-21T23:08:13Z

Hi,

rebased and added the log msg @estesp mentioned.

estesp · 2016-04-01T16:51:18Z

code LGTM.

Per comment in the more recent PR re: userns + AUFS, @thaJeztah do you know which doc would be the right place to explain this behavior/availability of running the engine in a userns?

hallyn · 2016-04-28T16:51:45Z

Ah ok if you're using the stock xenial kernel then the patch in the closed merge request I sent earlier should work for you. It'll use the 'nsroot=' field in mountinfo to figure out what it needs. (The nsroot= is not likely to go upstream, hence the closed merge request)

cpuguy83 · 2016-05-11T15:55:45Z

So I still can't get this to work.
Using unshare I get the lchown error. If I disable lchown then I get this overlay error: error creating overlay mount to /var/lib/docker/overlay/0ce08510fd8ca9013970f06c67003d6a0c474b4ac2d5f959c440067430ae665d-init/merged: operation not permitted

hallyn · 2016-05-12T06:06:12Z

@cpuguy83 - can you show the output of 'lxc config show containername --expanded' ?
Can you create a new lxd container with

lxc launch ubuntu;xenial x1 -p default -p docker

and use the docker.io package in the archive, and see whether that also fails for you?

hallyn · 2016-05-26T00:30:44Z

ping?

thaJeztah · 2016-05-30T13:19:54Z

Hi @hallyn, we brought this PR up in our review session last Thursday. Unfortunately, @cpuguy83 didn't find the time to look into this again (and so far didn't manage to get it running properly 😊).

ping @estesp perhaps you have some time to try this?

jessfraz · 2016-06-23T18:12:12Z

There should be a test added for this, I bet you can use unshare, and add it to the daemon suite

jessfraz · 2016-06-23T18:12:36Z

otherwise we will never know if it breaks :)

hallyn · 2016-06-24T15:29:55Z

A test would definately be good. IMO this was a bugfix not a feature so a testcase could be a follow-on, but especially now that it's been months since the patch was written, it would already be nice to have one.

I'm all for someone resubmitting my patch with an added-on test :)

icecrime · 2016-07-28T23:32:48Z

Ping @estesp: halp!

estesp · 2016-07-31T04:32:55Z

I may also try with lxc, but I thought using runc might be a good option given it will be installed in the CI system as one of the build outputs along with the daemon.

Ran into several issues with an ubuntu image that I could mostly get around with the following daemon startup:

PATH=/usr/bin dockerd --oom-score-adjust=0 --iptables=false --bridge=none

However, on docker pull alpine after starting the daemon (with vfs being the only driver that seems happy):

Using default tag: latest
latest: Pulling from library/alpine
e110a4a17941: Extracting [=>                                                 ] 65.54 kB/2.31 MB
ERRO[0013] Error trying v2 registry: failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creating mount namespace before pivot: operation not permitted
ERRO[0013] Attempting next endpoint for pull after error: failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creatinge110a4a17941: Extracting [==================================================>]  2.31 MB/2.31 MB
failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creating mount namespace before pivot: operation not permitted
root@ubuntu:/#

Maybe LXC is set up better for this nesting, but I would like to figure out how to get runc as a valid "candidate" for the outer layer of the nesting if at all possible. Any thoughts @hallyn ?

estesp · 2016-07-31T04:46:36Z

Ah, this is caused by #22506; seems we will need to extend this patch to rely on/fallback to chroot if we are nested in a userns.

estesp · 2016-07-31T05:03:11Z

Getting further after fallback to "real choot" if nested userns, but shm tmpfs mount seems to be a problem; or a real problem is not logged and the cleanup of mounts is just a side effect:

root@ubuntu # docker run -ti --rm alpine sh
WARN[0067] failed to cleanup ipc mounts:
failed to umount /var/lib/docker/containers/3cab03c60610a93231bd95a63ba2ad4f914feec3b1655b082170c59c3318fb6b/shm: operation not permitted
ERRO[0067] Handler for POST /v1.25/containers/3cab03c60610a93231bd95a63ba2ad4f914feec3b1655b082170c59c3318fb6b/start returned error: mounting shm tmpfs: operation not permitted
docker: Error response from daemon: mounting shm tmpfs: operation not permitted.

At this point I have to stop for now, but suggestions welcome. Looks like we are close to a testable scenario with runc, maybe?

thaJeztah · 2016-08-09T12:39:12Z

ping @mlaventure perhaps you could have a look?

estesp · 2016-08-10T21:27:17Z

Here's the real issue; at least running under runc with a fairly vanilla config.json, the /sys mount is not owned by my remapped root, so directory creation for cgroups fails miserably:

root@ubuntu:~# docker run --rm alpine date
ERRO[0005] containerd: start container                   error=oci runtime error: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys
/fs/cgroup/cpuset/docker: permission denied" id=3fa1606bd39a2f2ac276e47830865e58b6d7f1426012843d91cb08042fd89805
ERRO[0005] Create container failed with error: oci runtime error: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys/fs/cgroup/cpuse
t/docker: permission denied"

kimh · 2016-08-11T08:27:48Z

I'm trying to catch up this PR because I also want to run docker inside user namespace (unpriv LXC container) and want to reproduce the errors locally first.

@estesp Mind sharing in what env you are using? Specifically, I like to know

are you using docker built from master with the PR change?
on what env you started docker command? (lxc container?)

Also, I'm curious if running the latest docker inside userns becomes possible without changing runc/containerd? Or is it something we'll find out eventually?

estesp · 2016-08-11T20:16:32Z

@kimh it is possible with LXC and this PR plus an additional change that is required since this PR was created (I mentioned it above related to chroot/untar fallback in a comment 12 days ago).

I will be carrying this PR with the additional change, but had been stuck on trying to see if I could get a working test using runc as the outer user namespaced container, but running into trouble as you can see.

I did try via LXC today and with this PR + my added change I can run the daemon with "dockerd -D -s vfs --oom-score-adjust=0" and pull images and run them successfully in an LXC ubuntu:xenial container.

kimh · 2016-08-11T23:11:00Z

@estesp So if my understanding is correct, my added change refers to fallback to chroot under nested userns ? Any chance you can share the change with us? I looked for the branch in your fork but not sure which one it is. I'm thrilled about testing this 💣

estesp · 2016-08-12T20:40:42Z

@kimh please see my carry PR #25672 that has both patches. I'm going to close this PR and hopefully we can get the carried PR merged

redbaron · 2018-04-01T11:00:10Z

daemon/graphdriver/overlay/copy.go

 		case os.ModeNamedPipe:
 			fallthrough
 		case os.ModeSocket:
+			if rsystem.RunningInUserNS() {


this seems be a typo. according to commit message, it should skip in os.ModeDevice case, not in os.ModeSocket

This change was introduced early in the development of rootless support, before all the kinks were worked out and rootlesskit was built. The author was testing the daemon by inside a user namespace set up by runc, observed that the unshare(2) syscall was returning EPERM, and assumed that it was a fundamental limitation of user namespaces. Seeing as the kernel documentation (of today) disagrees with that assessment and that unshare demonstrably works inside user namespaces, I can only assume that the EPERM was due to a quirk of their test environment, such as a seccomp filter set up by runc blocking the unshare syscall. moby#20902 (comment) Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the special-casing so those issues can also be fixed for rootless daemons. This reverts commit dc95056. Signed-off-by: Cory Snider <[email protected]>

This change was introduced early in the development of rootless support, before all the kinks were worked out and rootlesskit was built. The author was testing the daemon by inside a user namespace set up by runc, observed that the unshare(2) syscall was returning EPERM, and assumed that it was a fundamental limitation of user namespaces. Seeing as the kernel documentation (of today) disagrees with that assessment and that unshare demonstrably works inside user namespaces, I can only assume that the EPERM was due to a quirk of their test environment, such as a seccomp filter set up by runc blocking the unshare syscall. moby/moby#20902 (comment) Mount namespaces are necessary to address #38995 and #43390. Revert the special-casing so those issues can also be fixed for rootless daemons. This reverts commit 047305a. Signed-off-by: Cory Snider <[email protected]>

GordonTheTurtle added the status/0-triage label Mar 3, 2016

vdemeester added status/1-design-review and removed status/0-triage labels Mar 3, 2016

hallyn force-pushed the 2016-03-02/nest.partial branch from 6dda0d6 to 7101e12 Compare March 3, 2016 08:52

hallyn changed the title ~~2016 03 02/nest.partial~~ Support running inside a user namespace Mar 3, 2016

hallyn force-pushed the 2016-03-02/nest.partial branch from 7101e12 to 412c77a Compare March 3, 2016 08:54

thaJeztah added the status/failing-ci Indicates that the PR in its current state fails the test suite label Mar 3, 2016

estesp reviewed Mar 3, 2016
View reviewed changes

estesp mentioned this pull request Mar 3, 2016

Stub RunningInUserNS for non-Linux opencontainers/runc#620

Merged

tianon reviewed Mar 8, 2016
View reviewed changes

calavera changed the title ~~Support running inside a user namespace~~ Support running the Engine daemon inside a user namespace Mar 14, 2016

hallyn force-pushed the 2016-03-02/nest.partial branch 2 times, most recently from bdb2d33 to a32e59c Compare March 21, 2016 22:57

thaJeztah added status/2-code-review and removed status/failing-ci Indicates that the PR in its current state fails the test suite status/1-design-review labels Mar 31, 2016

thaJeztah mentioned this pull request Jun 9, 2016

docker 1.11.2 inside LXC + userns-remap #23405

Closed

vdemeester mentioned this pull request Aug 3, 2016

test-integration-cli for userns don not use DockerDaemonSuite #25355

Closed

estesp mentioned this pull request Aug 12, 2016

Allow engine to run inside a user namespace #25672

Merged

estesp closed this Aug 12, 2016

thaJeztah mentioned this pull request Mar 19, 2018

Creating FIFO Files in user namespace? #36622

Closed

redbaron reviewed Apr 1, 2018

View reviewed changes

corhere mentioned this pull request Oct 4, 2022

[DO NOT MERGE] Test if pivot_root can be used from a user ns #44240

Closed

Conversation

hallyn commented Mar 3, 2016

Uh oh!

vdemeester commented Mar 3, 2016

Uh oh!

hallyn commented Mar 3, 2016

Uh oh!

thaJeztah commented Mar 3, 2016

Uh oh!

thaJeztah commented Mar 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hallyn commented Mar 3, 2016

Uh oh!

estesp commented Mar 3, 2016

Uh oh!

estesp commented Mar 3, 2016

Uh oh!

hallyn commented Mar 4, 2016

Uh oh!

thaJeztah commented Mar 4, 2016

Uh oh!

hallyn commented Mar 5, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

estesp commented Mar 17, 2016

Uh oh!

thaJeztah commented Mar 21, 2016

Uh oh!

hallyn commented Mar 21, 2016

Uh oh!

estesp commented Apr 1, 2016

Uh oh!

hallyn commented Apr 28, 2016 via email

Uh oh!

cpuguy83 commented May 11, 2016

Uh oh!

hallyn commented May 12, 2016

Uh oh!

hallyn commented May 26, 2016

Uh oh!

thaJeztah commented May 30, 2016

Uh oh!

jessfraz commented Jun 23, 2016

Uh oh!

jessfraz commented Jun 23, 2016

Uh oh!

hallyn commented Jun 24, 2016

Uh oh!

icecrime commented Jul 28, 2016

Uh oh!

estesp commented Jul 31, 2016

Uh oh!

estesp commented Jul 31, 2016

Uh oh!

estesp commented Jul 31, 2016

Uh oh!

thaJeztah commented Aug 9, 2016

Uh oh!

estesp commented Aug 10, 2016

kimh commented Aug 11, 2016 •

edited

Loading