Skip to content

Support running the Engine daemon inside a user namespace#20902

Closed
hallyn wants to merge 1 commit intomoby:masterfrom
hallyn:2016-03-02/nest.partial
Closed

Support running the Engine daemon inside a user namespace#20902
hallyn wants to merge 1 commit intomoby:masterfrom
hallyn:2016-03-02/nest.partial

Conversation

@hallyn
Copy link
Contributor

@hallyn hallyn commented Mar 3, 2016

Two commits to accomodate things we cannot do when running docker from a container which is in a user namespace. Combined with one more patch to libcontainer itself, this suffices to run docker inside a user namespaced container.

@vdemeester
Copy link
Member

Hi @hallyn, thanks for contribution. Could you do the following :

  • Using a better title for the PR "2016 03 02/nest.partial" doesn't really help us to know what it relates to ("Run docker daemon inside a docker namespace enabled engine" could be better).
  • We do not change anything that is in vendor/ directory is not, changes on those must be done upstream (in opencontainers/runc in your case).
  • Please use gofmt -s -w on the file changed
09:05:06 ---> Making bundle: validate-gofmt (in bundles/1.11.0-dev/validate-gofmt)
09:05:07 These files are not properly gofmt'd:
09:05:07  - pkg/archive/archive.go

@hallyn
Copy link
Contributor Author

hallyn commented Mar 3, 2016

D'oh! the third patch (which edits vendor/) was supposed to be removed and is in fact already a PR against opencontainers/runc. Removing it for real now.

@hallyn hallyn force-pushed the 2016-03-02/nest.partial branch from 6dda0d6 to 7101e12 Compare March 3, 2016 08:52
@hallyn hallyn changed the title 2016 03 02/nest.partial Support running inside a user namespace Mar 3, 2016
@hallyn hallyn force-pushed the 2016-03-02/nest.partial branch from 7101e12 to 412c77a Compare March 3, 2016 08:54
@thaJeztah
Copy link
Member

ping @estesp for user namespaces

@thaJeztah
Copy link
Member

Got panics on CI

09:14:51 Error response from daemon: Untar re-exec error: exit status 2: output: panic: runtime error: invalid memory address or nil pointer dereference
09:14:51 [signal 0xb code=0x1 addr=0x88 pc=0x90bcf6]

@thaJeztah thaJeztah added the status/failing-ci Indicates that the PR in its current state fails the test suite label Mar 3, 2016
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the cause of the test panic as options is still nil at this point.

However, I assume this is actually being passed in via the JSON "options" object and shouldn't be set here anyway, so maybe a stray leftover from testing?

@hallyn
Copy link
Contributor Author

hallyn commented Mar 3, 2016

Hm, I don't know what happened there, that doesn't look like my original patch... checking.

@estesp
Copy link
Contributor

estesp commented Mar 3, 2016

Last step here is to handle cross-platform issue with the RunningInUserNS:

19:09:09 pkg/archive/archive_unix.go:84: undefined: "github.com/opencontainers/runc/libcontainer/system".RunningInUserNS

Running make cross will reveal the issue (the libcontainer/system/linux.go functions are only compiled on linux, but pkg/archive/archive_unix.go is all Unix-like systems).

I think the cleanest way is a runC PR to have a stub for non-Linux that returns false automatically.

@estesp
Copy link
Contributor

estesp commented Mar 3, 2016

@hallyn I created opencontainers/runc#620 to solve this; I'm hoping to vendor in opencontainers/runc very soon for the shared namespace work that is now in libcontainer/nsenter; so hopefully we can get this solved ASAP.

@hallyn
Copy link
Contributor Author

hallyn commented Mar 4, 2016

Thanks @estesp ! Do I understand right that I should just wait for #620 to clear?

@thaJeztah
Copy link
Member

@hallyn you can include the change from opencontainers/runc#620 in your PR by adding the file here https://github.com/docker/docker/tree/master/vendor/src/github.com/opencontainers/runc/libcontainer/system. This will make the "vendor" CI check fail, but the other tests should then complete successfully. Once the RunC PR is merged, and vendored, you can rebase your PR to make the vendor check pass

@hallyn
Copy link
Contributor Author

hallyn commented Mar 5, 2016

Thanks, done, will watch for the runc pr merge...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options is nil until the line further down, so this fails with panic: runtime error: invalid memory address or nil pointer dereference 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, even worse -- options doesn't appear to ever be set to non-nil within this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? the diff I have here has options.InUserNS = true happening after the unmarshalling

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, @mwhudson managed to trigger a panic here on s390x, so I wonder if there's some edge case where json.Unmarshal doesn't create the struct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nevermind, that's a different set of changes from v1.10.2...hallyn:v1.10.0.serge.2#diff-4166b9ad558bc9c8d0ff7b01b69c128aR34 causing that one -- carry on! 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting Tianon Gravi ([email protected]):

@@ -49,6 +51,10 @@ func applyLayer() {
fatal(err)
}

  • if inUserns {
  •   options.InUserNS = true
    

(even still, couldn't this just be options.InUserNS = inUserns? or does setting it regardless like that have other consequences?)

nu, bc we must detect it before the chroot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting Tianon Gravi ([email protected]):

@@ -49,6 +51,10 @@ func applyLayer() {
fatal(err)
}

  • if inUserns {
  •   options.InUserNS = true
    

Ah nevermind, that's a different set of changes from v1.10.2...hallyn:v1.10.0.serge.2#diff-4166b9ad558bc9c8d0ff7b01b69c128aR34 causing that one -- carry on! 👍

yeah i had it wrong originally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, I triggered this by running "sudo adt-run --unbuilt-tree . --- null". and I just tried the same on arm64 with the same results, so it would probably happen on amd64 too...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly which commit id were you using?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my fault -- he was testing on essentially ab8e54b, which is missing 45afdc9af9c300887cfd9952e04f5ba7d500f4d9 (which fixes the panic).

@calavera calavera changed the title Support running inside a user namespace Support running the Engine daemon inside a user namespace Mar 14, 2016
@estesp
Copy link
Contributor

estesp commented Mar 17, 2016

@hallyn we need to wait until the containerd integration PR merges (#20662) to get rid of your vendor commit--that PR will update the opencontainers/runc vendor to include the needed changes.

In the meantime, what do you think about logging the "skips" for unsupported file types when tarring/untarring in the archive packages? That way there will be some record if bugs come up in the future about "missing" content?

@thaJeztah
Copy link
Member

ping @hallyn #20662 was merged, can you rebase and address the comments (if any needs addressing?)

@hallyn hallyn force-pushed the 2016-03-02/nest.partial branch 2 times, most recently from bdb2d33 to a32e59c Compare March 21, 2016 22:57
@hallyn
Copy link
Contributor Author

hallyn commented Mar 21, 2016

Hi,

rebased and added the log msg @estesp mentioned.

@thaJeztah thaJeztah added status/2-code-review and removed status/failing-ci Indicates that the PR in its current state fails the test suite status/1-design-review labels Mar 31, 2016
@estesp
Copy link
Contributor

estesp commented Apr 1, 2016

code LGTM.

Per comment in the more recent PR re: userns + AUFS, @thaJeztah do you know which doc would be the right place to explain this behavior/availability of running the engine in a userns?

@hallyn
Copy link
Contributor Author

hallyn commented Apr 28, 2016 via email

@cpuguy83
Copy link
Member

So I still can't get this to work.
Using unshare I get the lchown error. If I disable lchown then I get this overlay error: error creating overlay mount to /var/lib/docker/overlay/0ce08510fd8ca9013970f06c67003d6a0c474b4ac2d5f959c440067430ae665d-init/merged: operation not permitted

@hallyn
Copy link
Contributor Author

hallyn commented May 12, 2016

@cpuguy83 - can you show the output of 'lxc config show containername --expanded' ?
Can you create a new lxd container with

lxc launch ubuntu;xenial x1 -p default -p docker

and use the docker.io package in the archive, and see whether that also fails for you?

@hallyn
Copy link
Contributor Author

hallyn commented May 26, 2016

ping?

@thaJeztah
Copy link
Member

Hi @hallyn, we brought this PR up in our review session last Thursday. Unfortunately, @cpuguy83 didn't find the time to look into this again (and so far didn't manage to get it running properly 😊).

ping @estesp perhaps you have some time to try this?

@jessfraz
Copy link
Contributor

There should be a test added for this, I bet you can use unshare, and add it to the daemon suite

@jessfraz
Copy link
Contributor

otherwise we will never know if it breaks :)

@hallyn
Copy link
Contributor Author

hallyn commented Jun 24, 2016

A test would definately be good. IMO this was a bugfix not a feature so a testcase could be a follow-on, but especially now that it's been months since the patch was written, it would already be nice to have one.

I'm all for someone resubmitting my patch with an added-on test :)

@icecrime
Copy link
Contributor

Ping @estesp: halp!

@estesp
Copy link
Contributor

estesp commented Jul 31, 2016

I may also try with lxc, but I thought using runc might be a good option given it will be installed in the CI system as one of the build outputs along with the daemon.

Ran into several issues with an ubuntu image that I could mostly get around with the following daemon startup:

PATH=/usr/bin dockerd --oom-score-adjust=0 --iptables=false --bridge=none

However, on docker pull alpine after starting the daemon (with vfs being the only driver that seems happy):

Using default tag: latest
latest: Pulling from library/alpine
e110a4a17941: Extracting [=>                                                 ] 65.54 kB/2.31 MB
ERRO[0013] Error trying v2 registry: failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creating mount namespace before pivot: operation not permitted
ERRO[0013] Attempting next endpoint for pull after error: failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creatinge110a4a17941: Extracting [==================================================>]  2.31 MB/2.31 MB
failed to register layer: ApplyLayer exit status 1 stdout:  stderr: Error creating mount namespace before pivot: operation not permitted
root@ubuntu:/#

Maybe LXC is set up better for this nesting, but I would like to figure out how to get runc as a valid "candidate" for the outer layer of the nesting if at all possible. Any thoughts @hallyn ?

@estesp
Copy link
Contributor

estesp commented Jul 31, 2016

Ah, this is caused by #22506; seems we will need to extend this patch to rely on/fallback to chroot if we are nested in a userns.

@estesp
Copy link
Contributor

estesp commented Jul 31, 2016

Getting further after fallback to "real choot" if nested userns, but shm tmpfs mount seems to be a problem; or a real problem is not logged and the cleanup of mounts is just a side effect:

root@ubuntu # docker run -ti --rm alpine sh
WARN[0067] failed to cleanup ipc mounts:
failed to umount /var/lib/docker/containers/3cab03c60610a93231bd95a63ba2ad4f914feec3b1655b082170c59c3318fb6b/shm: operation not permitted
ERRO[0067] Handler for POST /v1.25/containers/3cab03c60610a93231bd95a63ba2ad4f914feec3b1655b082170c59c3318fb6b/start returned error: mounting shm tmpfs: operation not permitted
docker: Error response from daemon: mounting shm tmpfs: operation not permitted.

At this point I have to stop for now, but suggestions welcome. Looks like we are close to a testable scenario with runc, maybe?

@thaJeztah
Copy link
Member

ping @mlaventure perhaps you could have a look?

@estesp
Copy link
Contributor

estesp commented Aug 10, 2016

Here's the real issue; at least running under runc with a fairly vanilla config.json, the /sys mount is not owned by my remapped root, so directory creation for cgroups fails miserably:

root@ubuntu:~# docker run --rm alpine date
ERRO[0005] containerd: start container                   error=oci runtime error: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys
/fs/cgroup/cpuset/docker: permission denied" id=3fa1606bd39a2f2ac276e47830865e58b6d7f1426012843d91cb08042fd89805
ERRO[0005] Create container failed with error: oci runtime error: process_linux.go:258: applying cgroup configuration for process caused "mkdir /sys/fs/cgroup/cpuse
t/docker: permission denied"

@kimh
Copy link

kimh commented Aug 11, 2016

I'm trying to catch up this PR because I also want to run docker inside user namespace (unpriv LXC container) and want to reproduce the errors locally first.

@estesp Mind sharing in what env you are using? Specifically, I like to know

  • are you using docker built from master with the PR change?
  • on what env you started docker command? (lxc container?)

Also, I'm curious if running the latest docker inside userns becomes possible without changing runc/containerd? Or is it something we'll find out eventually?

@estesp
Copy link
Contributor

estesp commented Aug 11, 2016

@kimh it is possible with LXC and this PR plus an additional change that is required since this PR was created (I mentioned it above related to chroot/untar fallback in a comment 12 days ago).

I will be carrying this PR with the additional change, but had been stuck on trying to see if I could get a working test using runc as the outer user namespaced container, but running into trouble as you can see.

I did try via LXC today and with this PR + my added change I can run the daemon with "dockerd -D -s vfs --oom-score-adjust=0" and pull images and run them successfully in an LXC ubuntu:xenial container.

@kimh
Copy link

kimh commented Aug 11, 2016

@estesp So if my understanding is correct, my added change refers to fallback to chroot under nested userns ? Any chance you can share the change with us? I looked for the branch in your fork but not sure which one it is. I'm thrilled about testing this 💣

@estesp
Copy link
Contributor

estesp commented Aug 12, 2016

@kimh please see my carry PR #25672 that has both patches. I'm going to close this PR and hopefully we can get the carried PR merged

case os.ModeNamedPipe:
fallthrough
case os.ModeSocket:
if rsystem.RunningInUserNS() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems be a typo. according to commit message, it should skip in os.ModeDevice case, not in os.ModeSocket

corhere added a commit to corhere/moby that referenced this pull request Oct 13, 2022
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby#20902 (comment)

Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit dc95056.

Signed-off-by: Cory Snider <[email protected]>
corhere added a commit to corhere/moby that referenced this pull request Oct 14, 2022
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby#20902 (comment)

Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit dc95056.

Signed-off-by: Cory Snider <[email protected]>
corhere added a commit to corhere/moby that referenced this pull request Oct 21, 2022
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby#20902 (comment)

Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit dc95056.

Signed-off-by: Cory Snider <[email protected]>
corhere added a commit to corhere/moby that referenced this pull request Oct 26, 2022
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby#20902 (comment)

Mount namespaces are necessary to address moby#38995 and moby#43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit dc95056.

Signed-off-by: Cory Snider <[email protected]>
dmcgowan pushed a commit to moby/go-archive that referenced this pull request Apr 3, 2025
This change was introduced early in the development of rootless support,
before all the kinks were worked out and rootlesskit was built. The
author was testing the daemon by inside a user namespace set up by runc,
observed that the unshare(2) syscall was returning EPERM, and assumed
that it was a fundamental limitation of user namespaces. Seeing as the
kernel documentation (of today) disagrees with that assessment and that
unshare demonstrably works inside user namespaces, I can only assume
that the EPERM was due to a quirk of their test environment, such as a
seccomp filter set up by runc blocking the unshare syscall.
moby/moby#20902 (comment)

Mount namespaces are necessary to address #38995 and #43390. Revert the
special-casing so those issues can also be fixed for rootless daemons.

This reverts commit 047305a.

Signed-off-by: Cory Snider <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.