Skip to content

/dev should not be readonly with --readonly flag#35344

Merged
thaJeztah merged 1 commit into
moby:masterfrom
rhatdan:readonly-/dev
Nov 3, 2017
Merged

/dev should not be readonly with --readonly flag#35344
thaJeztah merged 1 commit into
moby:masterfrom
rhatdan:readonly-/dev

Conversation

@rhatdan
Copy link
Copy Markdown
Contributor

@rhatdan rhatdan commented Oct 31, 2017

/dev is mounted on a tmpfs inside of a container. Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense. Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.

I have no idea why /dev was excluded.

fixes #35041

Signed-off-by: Daniel J Walsh [email protected]

- What I did

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@thaJeztah
Copy link
Copy Markdown
Member

ping @tonistiigi @justincormack @kolyshkin PTAL

@justincormack
Copy link
Copy Markdown
Contributor

There are not plenty of places where there is a writeable tmpfs, currently there is only one that is not yet removeable (/dev/mqueue which acts as a tmpfs). Since we added --ipc=none support recently we removed the second to last one, so we are close to being able to have totally unwriteable containers. So I am not in favour of always making /dev writeable always. I don't mind adding an option to make it writeable, or even the reverse, so long as there is a way to make it not writeable.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Oct 31, 2017

/dev/shm?

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Oct 31, 2017

What is the goal of --readonly. I see it as preventing a process from making changes to the image. Not to where it can not write on any temporary file system.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Oct 31, 2017

       --read-only=true|false
          Mount the container's root filesystem as read only.
docker run --help  | grep root-readonly
      --read-only                   Mount the container's root filesystem as read only

Man page and help indicate that the "root" file system is readonly not all file systems inside of the image.

@justincormack
Copy link
Copy Markdown
Contributor

--ipc=none removes /dev/shm. I think the wording is a bit unclear, and while yes in theory it was the rootfs, which corresponds exactly to the runtime spec config, it always affected some other filesystems. I would be happy if these were affected by other flags instead.

@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Oct 31, 2017

I guess I don't understand the goal of --read-only flag then. The way I would describe it, is the flag gives a security feature where a container can not modify the image used for it, and just restarting the container would remove any content created by processes running inside the container, except for content volume mounted into the container.

If your definition is that --read-only means that the processes can not write to anywhere inside of the container except for volumes mounted into the container, then you are failing at that by overriding

"/proc", "/dev/pts", "/dev/mqueue":

IE Processes would be able to write to /dev/pts, and /dev/mqueue, as well as able to write to /proc.

# docker run --read-only -ti fedora mount | grep rw
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:container_file_t:s0:c163,c465",size=65536k)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,seclabel,gid=5,mode=620,ptmxmode=000)

By either definition --read-only is broken.

From a security point of view, the security goal is that by default a hacked container can not write a backdoor so that if the container was restarted it would be already compromised.

I actually believe all containers in production should run with --read-only rootfs, with /run /var/tmp, /tmp with tmpfs mounted and writing to /dev/*, Buf if users need to do lots of toggling of flags to make tmpfs work, then --read-only will hardly ever being used, which means we miss an opportunity to make containers more secure.

@thaJeztah
Copy link
Copy Markdown
Member

ping @n4ss as well (for entitlements)

@justincormack
Copy link
Copy Markdown
Contributor

You cant actually write stuff to /dev/pts as it is a proper pseudo filesystem that only accepts pty nodes not arbitrary files.

However, having discovered that you can create a file using memfd_create and then use execveat to execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

/dev is mounted on a tmpfs inside of a container.  Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense.  Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.

I have no idea why /dev was excluded.

Signed-off-by: Daniel J Walsh <[email protected]>
@rhatdan
Copy link
Copy Markdown
Contributor Author

rhatdan commented Nov 2, 2017

@thaJeztah Tests pass who else do I need to get approval from to get this merged?

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Copy Markdown
Member

ping @justincormack @kolyshkin PTAL

@kolyshkin
Copy link
Copy Markdown
Contributor

LGTM

I admit I also do not entirely understand why --read-only is not specific to container's rootfs only. Was it done as a way to prevent creating/running a new binary?

However, having discovered that you can create a file using memfd_create and then use execveat to
execute it, I hereby give up on not providing writeable tmpfs to containers, so I guess this change is ok...

Can seccomp be used to disallow memfd_create?

@kolyshkin
Copy link
Copy Markdown
Contributor

Also, /dev is (hopefully) mounted with noexec so you can't really run a binary from it.

@kolyshkin
Copy link
Copy Markdown
Contributor

Oops, looks like it's not

kir@kd:~$ docker run -it alpine mount | grep 'on /dev '
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)

Unless there's something we need to run from /dev, I think this "no-noexec" should be fixed before merging this PR.

@n4ss
Copy link
Copy Markdown

n4ss commented Nov 3, 2017

@justincormack would it makes sense to do the following as part a fs.read-only entitlement?

  • prevent memfd_create syscall
  • remove /dev/shm

I get the feeling that the only real read-only is going to come from IMA+EVM..

I agree with @kolyshkin, let's make sure /dev has the most restrictive mount option besides ro (even as part of this PR).

@justincormack
Copy link
Copy Markdown
Contributor

well, memfd_create is really useful and I have lots of code that uses it...

I did think /dev was mounted noexec I will have to check if it changed at some point.

@justincormack
Copy link
Copy Markdown
Contributor

I think we can change `noexec`` in another pr, it is unrelated.

@thaJeztah
Copy link
Copy Markdown
Member

I agree we can do the other changes in a follow-up

@n4ss @kolyshkin could one of you make a pull request or open an issue to track that effort?

@thaJeztah thaJeztah merged commit 7d8affa into moby:master Nov 3, 2017
@n4ss
Copy link
Copy Markdown

n4ss commented Nov 3, 2017

@thaJeztah #35397

justincormack added a commit to justincormack/cri-containerd that referenced this pull request Nov 6, 2017
…root

Since moby/moby#35344 we clarified that this behaviour
was a mistake, and the read only flag should just apply to the actual rootfs,
so it corresponds to the OCI read-only option. Other mounts may be able to be
adjusted by re-specifying them or other means but this is unrelated.

Signed-off-by: Justin Cormack <[email protected]>
Graee pushed a commit to Graee/cri-containerd that referenced this pull request Jan 5, 2018
…root

Since moby/moby#35344 we clarified that this behaviour
was a mistake, and the read only flag should just apply to the actual rootfs,
so it corresponds to the OCI read-only option. Other mounts may be able to be
adjusted by re-specifying them or other means but this is unrelated.

Signed-off-by: Justin Cormack <[email protected]>
justincormack added a commit to justincormack/docker that referenced this pull request Apr 6, 2018
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see moby#35344.

Signed-off-by: Justin Cormack <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ReadOnlyRootFs causes /dev tmpfs to be mounted in ro mode

7 participants