Skip to content

Conversation

@bartier
Copy link
Contributor

@bartier bartier commented Jun 18, 2023

closes #45518

Hi there, this is my first PR here, so please feel free to point me out if anything is wrong with this contribution. I looked the #45518 issue and I believe this could be a fix.

- What I did
Removed the function from the filtered syscalls as name_to_handle_at(2) is in fact innocuous and safe

- How I did it
@neersighted help at comment

- How to verify it
N/A

- Description for the changelog

Remove name_to_handle_at(2) from filtered syscalls

- A picture of a cute animal (not mandatory but encouraged)
image

@thaJeztah
Copy link
Member

Thanks for contributing! I see nobody posted here yet, but @neersighted has been digging into history (as well as Linux kernel man-pages and source-code) to verify these changes. I think some further digging was still needed, as there were some (very subtle?) cases where this could cause issues (stay tuned 👍)

(Code changes themselves look good 👍, but it may need a rebase to make the docker-py tests pass (which had an issue that has been fixed on master), but we can ignore those for the time being)

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this patch!

Since this load-bearing (lots of useful software, e.g. systemd wants this syscall), (potentially) security critical, and has been in this state for a very long time, I did some extensive digging.

First, some background on what name_to_handle_at(2) and open_by_handle_at(2) actually do: this is more-or-less a two-stage openat(2), as described in the manpage.

name_to_handle_at(2) performs a filesystem lookup, but instead of opening a file, it copies a (opaque) struct file_handle to userspace. That handle contains a string, which is looked up in the kernel as part of the system call.

open_by_handle_at(2) can then be used to open the path described by that file handle (which is dereferenced similarly to how it was saved), without doing a filesystem lookup a second time. Note that this does not defend against TOCTOU issues (e.g. the filesystem changing between calls); it merely allows one to pay the cost of resolving a path in the filesystem once, and amortize that cost over many usages.

This appears to have been added to the kernel to better support userspace network filesystem servers (which open and close many files and thus need all the performance shortcuts they can get), and the kernel implementation actually builds on the infrastructure of the exportfs subsystem used for the in-kernel NFS implementation.

The curious among us might find the following parts of the kernel source interesting:

Nonetheless, there are some differences from openat(2): open_by_handle_at(2) requires capabilities(7); specifically, CAP_DAC_READ_SEARCH, but openat(2) requires no privilege. Why is that?

First, let's establish what does not make open_by_handle_at(2) dangerous:

  • open_by_handle_at(2) does pass a opaque and kernel-provided data structure back in, over the kernel/userspace boundary. However, the kernel is quite resilient to potentially hostile inputs, and in this specific case, struct file_handle is just structured data and is safely deserialized -- the kernel never tries to use what it gets back from userspace as a normal internal data structure.
  • Permissions are checked (the thing that the related CAP_DAC_OVERRIDE capability allows bypassing) during open_by_handle_at(2), in the common do_open used by the entire open(2) family of system calls, so there is no risk of a stored struct file_handle being used to bypass (Unix) permissions.

If there is no risk of attacking the kernel, or bypassing DAC, what is the risk of this system call then? Well, if the user can construct arbitrary struct file_handles (hint: they can) and pass them to the kernel, there is the risk they manage to forge a handle and open a path that they otherwise could not access. Note that 'access' is not in the Unix permissions sense -- I use 'access' here to mean "open a dentry that cannot otherwise be resolved using the visible filesystem."

It's notable that there are several situations where filesystem access is not limited by permissions (e.g. a program has root), but instead by filesystem visibility:

  • chroot(2)
  • mount namespaces (including pivot_root(2)), which are used in containers (more on this later)

Indeed, if we look at the history of the patch that introduced these system calls:

  • The initial version of the patch required CAP_SYS_ADMIN (infamously, 'the new root').
  • The next version of the patch moved to CAP_DAC_OVERRIDE.
    • Presumably the logic here was that forging a handle was similar to 'a process that is trusted to ignore Unix permissions'.
  • Subsequent versions decreased the capability to CAP_DAC_READ_SEARCH.
    • As the only risk here is resolving a path that you otherwise could not, 'read' + 'search' (resolve path) are presumably a more granular fit.
    • The patch does note that CAP_DAC_SEARCH would be preferred if it existed, as it would more accurately reflect the threat model.
  • The final, merged version of the patch made no changes to the permissions/capabilities model.

Given the above context, it is very clear that name_to_handle_at(2) is a safe operation (all it does is return a value that is useless without its sister syscall), and open_by_handle_at(2) is the dangerous component. So why is name_to_handle_at(2) more restricted?

Before we address that point directly, it helps to think about how Linux containers work (and how they isolate filesystem access). As noted above, we use mount namespaces(7) as part of the containerization. In fact, mount namespaces are one of the most critical forms of isolation making up a container; you can have root inside a container (and thus are allowed to modify any file), but you can't use that root access to affect files outside the container if you can't see the filesystem.

This is in fact false if you can abuse open_by_handle_at(2) to get access to a path you otherwise could not see in your mount namespace; this is the basis of one of the first container escapes, 'shocker.'

Others have already done an excellent job at explaining the exploit; these links should give more background if it is interesting to you:

This was ultimately solved in Docker by dropping capabilities by default, beyond a narrow list of innocuous ones (e.g. CAP_MKNOD).

Now we understand this family of system calls, and their risk to containers. However, we still haven't puzzled out why name_to_handle_at(2) was gated behind CAP_SYS_ADMIN; to begin understanding that, let's look at the evolution of the default seccomp profile:

  • The first version of the profile blocked open_by_handle_at(2) as it was implicated in shocker.
    • This appears to be something of a 'why not?' situation; the actual escape is solved by dropping the capability, but it doesn't hurt to filter with seccomp as well...
  • A subsequent change was to move to an allowlist instead of a denylist; the explicit reference to open_by_handle_at(2) was dropped.
  • A later, major refactor redesigned the default seccomp profile to work like it does today: --cap-add implies changes to the seccomp profile (and the profile is once again a denylist). At this time, both open_by_handle_at(2) and name_to_handle_at(2) were gated on CAP_DAC_READ_SEARCH.
    • This appears to be a case of 'guilty by association', i.e. both syscalls share a manpage, and one is behind a capability, so why not block them both?
    • Fun fact: the docs used to mistakenly assert this was gated by CAP_SYS_NICE.
  • Finally, name_to_handle_at(2) was moved to CAP_SYS_ADMIN because systemd required it, and did not actually require open_by_handle_at(2).
    • It appears that no one reexamined why this was behind a capability, and assumed that it 'must' be because it was in some way dangerous.

In summary: this was incorrectly moved from CAP_DAC_READ_SEARCH to CAP_SYS_ADMIN (a more restrictive capability) in order to make it easier to run systemd (since systemd implied --cap-add SYS_ADMIN, and it made the command line shorter). However, name_to_handle_at(2) is in fact innocuous and safe, and I assert it should not be filtered in the default seccomp profile at all.

I suggest altering this PR to remove it from the list of filtered syscalls, and creating a new PR to adjust the docs.

@neersighted
Copy link
Member

neersighted commented Jun 22, 2023

Also, after the above change is made and accepted, this PR will need backports to all branches (as I assert that this is a bug; though others may disagree. If we decide to backport, I can do this). In addition, we'll need to adjust the profile in containerd; however, I am happy to do that work.

@tianon
Copy link
Member

tianon commented Jun 23, 2023

Fantastic summary, and well done on the sleuthing, @neersighted 🙇 👍

@thaJeztah
Copy link
Member

@bartier do you need help / details on making the suggested changes? (I'm sure @neersighted is also happy to update this PR if needed)

@bartier
Copy link
Contributor Author

bartier commented Jun 26, 2023

Hi @neersighted , thanks for the depth explanation. Also, I have created the PR docker/docs#17623 to match the changes.

Please let me know if I am missing something but I am happy to help.

@neersighted
Copy link
Member

Thanks! The change overall looks good to me now; could we update the PR title and description for the new direction, and maybe rephrase the commit message to be a little more obvious (e.g. remove name_to_handle_at(2) from filtered syscalls)?

Otherwise, LGTM 👍

@bartier bartier changed the title fix disagreement between seccomp docs and profile reality remove name_to_handle_at(2) from filtered syscalls Jun 26, 2023
@bartier
Copy link
Contributor Author

bartier commented Jun 26, 2023

@neersighted I've squashed to 1 commit and changes the PR title. I wonder if it is required to squash the commits as we can merge and squash in the GitHub UI.

@bartier bartier force-pushed the 45518-fix-disagreement branch from 7dbce6c to 44cd940 Compare June 26, 2023 20:59
@corhere
Copy link
Contributor

corhere commented Jun 26, 2023

I wonder if it is required to squash the commits as we can merge and squash in the GitHub UI.

We have disabled the "Squash and merge" option on this repository. We want the merge commit to be separate from the PR commit(s), which doesn't happen when GitHub squash-merges a PR.

@neersighted
Copy link
Member

Still needs the commit message updated, but otherwise LGTM

@bartier bartier force-pushed the 45518-fix-disagreement branch from 44cd940 to fdc9b7c Compare June 27, 2023 12:50
@bartier
Copy link
Contributor Author

bartier commented Jun 27, 2023

@neersighted Changed to match the PR title

Copy link
Member

@neersighted neersighted left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your patience!

@neersighted
Copy link
Member

neersighted added a commit to neersighted/containerd that referenced this pull request Jun 27, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/containerd that referenced this pull request Jun 27, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit 73a35de)
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/containerd that referenced this pull request Jun 27, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit 73a35de)
Signed-off-by: Bjorn Neergaard <[email protected]>
"mount",
"mount_setattr",
"move_mount",
"name_to_handle_at",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neersighted I noticed we remove this one from the list here, but don't add it to the allow-list,

Action: specs.ActAllow,

Doesn't that mean that it now gets the default (specs.ActErrno / SCMP_ACT_ERRNO)?

DefaultAction: specs.ActErrno,

neersighted added a commit to neersighted/moby that referenced this pull request Jun 28, 2023
Based on the analysis on [the previous PR][1].

  [1]: moby#45766 (review)

Signed-off-by: Bjorn Neergaard <[email protected]>
@neersighted neersighted added this to the v-future milestone Jun 28, 2023
neersighted added a commit to neersighted/moby that referenced this pull request Jun 28, 2023
Based on the analysis on [the previous PR][1].

  [1]: moby#45766 (review)

Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit b335e3d)
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/moby that referenced this pull request Jun 28, 2023
Based on the analysis on [the previous PR][1].

  [1]: moby#45766 (review)

Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit b335e3d)
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/moby that referenced this pull request Jun 28, 2023
Based on the analysis on [the previous PR][1].

  [1]: moby#45766 (review)

Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit b335e3d)
Resolved conflicts:
	profiles/seccomp/default_linux.go
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/containerd that referenced this pull request Jun 28, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/containerd that referenced this pull request Jun 28, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit 9a202e3)
Signed-off-by: Bjorn Neergaard <[email protected]>
neersighted added a commit to neersighted/containerd that referenced this pull request Jun 28, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
(cherry picked from commit 9a202e3)
Signed-off-by: Bjorn Neergaard <[email protected]>
@neersighted neersighted modified the milestones: v-future, 25.0.0 Jun 29, 2023
jsturtevant pushed a commit to jsturtevant/containerd that referenced this pull request Sep 21, 2023
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
juliusl pushed a commit to juliusl/containerd that referenced this pull request Jan 26, 2024
This syscall is used by systemd to request unique internal names for
paths in the cgroup hierarchy from the kernel, and is overall innocuous.

Due to [previous][1] [mistakes][2] in moby/moby, it ended up attached to
`CAP_SYS_ADMIN`; however, it should not be filtered at all.

An in-depth analysis is available [at moby/moby][3].

  [1]: moby/moby@a01c4dc#diff-6c0d906dbef148d2060ed71a7461907e5601fea78866e4183835c60e5d2ff01aR1627-R1639
  [2]: moby/moby@c1ca124
  [3]: moby/moby#45766 (review)

Co-authored-by: Vitor Anjos <[email protected]>
Signed-off-by: Bjorn Neergaard <[email protected]>
thaJeztah pushed a commit to moby/profiles that referenced this pull request Jul 22, 2025
Based on the analysis on [the previous PR][1].

  [1]: moby/moby#45766 (review)

Signed-off-by: Bjorn Neergaard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

name_to_handle_at is gated by CAP_SYS_ADMIN and not CAP_DAC_READ_SEARCH as documented

5 participants