Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Docker Shared PID proposal for per-pod configuration #1048

Merged
merged 1 commit into from
Jan 24, 2018

Conversation

verb
Copy link
Contributor

@verb verb commented Sep 12, 2017

We will need to support configuring shared PID namespace in docker per-pod.

@k8s-ci-robot
Copy link
Contributor

@verb: Your pull request title starts with "WIP", so the do-not-merge/work-in-progress label will be added.

This label will ensure that your pull request will not be merged. Remove the prefix from your pull request title to trigger the removal of the label and allow for your pull request to be merged.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 12, 2017
@verb
Copy link
Contributor Author

verb commented Sep 12, 2017

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Sep 12, 2017
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2017
@thockin
Copy link
Member

thockin commented Sep 15, 2017

@kubernetes/sig-node-proposals

@@ -60,18 +95,11 @@ The ability to disable shared PID namespaces is intended as a way to roll back
to prior behavior in the event of unforeseen problems. It won't be possible to
configure the behavior per-pod. We believe this is acceptable because:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably change this section.

but choosing not to support isolated PID namespaces indefinitely would influence
design choices. In this scenario we would avoid modifying the CRI or
implementing isolated namespaces in other runtimes. Enabling the shared
namespace in v1.Pod could happen in a new field or in an annotation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having problems parsing the two options. What's the difference? Is it just the default?

@verb
Copy link
Contributor Author

verb commented Sep 22, 2017 via email

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 22, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 22, 2017
@verb verb force-pushed the per-pod branch 2 times, most recently from 07dbdc9 to d765a80 Compare September 22, 2017 14:29
@verb verb changed the title WIP: Update Docker Shared PID proposal for per-pod configuration Update Docker Shared PID proposal for per-pod configuration Sep 22, 2017
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 22, 2017
@verb
Copy link
Contributor Author

verb commented Sep 22, 2017

@smarterclayton @yujuhong @derekwaynecarr This is now updated, internally consistent (I hope) and ready for review. It includes what we discussed previously in the SIG node meetup, which was adding a configuration option to both the Pod & CRI APIs.

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About time! :)

* We have not identified a concrete use case requiring isolated PID namespaces.
* Making PID namespace configurable requires changing the CRI, which we would
like to avoid since there are no use cases.
* We have not identified a concrete use case requiring isolated PID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about "I use systemd in my container and systemd pukes when it is not pid 1" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin I'm not sure how this happened, but some of your comments are for an outdated revision and some are current. I suspect you started a review and I updated the doc in the middle of it. Sorry about that.

The latest version of the proposal dropped this section and lists supporting pid == 1 in the goals.

per-pod configurable PID namespaces, and the choice hinges mainly on long-term
plans for PID namespace sharing.

### Option 1: Configurable Namespaces
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong What do you think here?


### Kubernetes API Changes

`v1.PodSecurityContext` gains a new field resembling the existing `hostPID`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? is shared PID somewhow a security escalation that we might want to deny users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion in kubernetes/kubernetes#48937 and kubernetes/kubernetes#1615, people want to be able to use PID namespace separation as a security feature.

@derekwaynecarr
Copy link
Member

@mrunalp - please review as well.

@derekwaynecarr derekwaynecarr self-assigned this Sep 25, 2017
The Container Runtime Interface (CRI) will be updated to explicitly support
shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID
namespace for the pod or return an error. When `sharedPID` is false, runtimes
MAY implement isolated PID namespaces. That is, omitting this option defaults to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why this is left as a MAY. Can't it be MUST?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for MUST.
I'd rather see the behavior clearly defined in CRI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAY is the status quo of no explicit default. I left it MAY for two reasons:

  1. Do isolated PID namespaces exist in all runtimes? What do hypervisor runtimes do? What about rkt? I'm not sure since I only use Docker.
  2. Do any current runtimes default to shared PID? Do we care about being backwards compatible with them? If yes, then we can't retroactively define a default behavior for v1.Pod, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong What do you think about an enum for the CRI? Something like: verb/kubernetes@db76a1b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten this section to clearly define the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong What do you think about an enum for the CRI? Something like: verb/kubernetes@db76a1b

Personally I like that better than a set of booleans for toggling the behavior :-) We could name the enum NamespaceType to be more generic.
That is, however, an API breaking change for CRI.

@mrunalp @Random-Liu @feiskyer thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great, a default of isolated it is. Good point on the generic name.

Also I realized that we'll need to be able to target a container's namespace for debug containers, which might influence the API design. This could be done as a fourth NamespaceType: verb/kubernetes@48fe4c2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feiskyer for frakti? @Random-Liu for containerd @mrunalp for cri-o?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At sig-node, we talked about this, and the decision is introducing enum. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong Enum does look cleaner to me. And we could easily change to use the new enum in cri-containerd.

shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID
namespace for the pod or return an error. When `sharedPID` is false, runtimes
MAY implement isolated PID namespaces. That is, omitting this option defaults to
pre-existing runtime behavior. For Docker the default remains isolated PID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the default is not really runtime specific. we need the api default to remain isolated across any runtime at this moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought the rkt & hypervisor runtimes defaulted to shared PID so making isolated the default could break existing config. I'm happy to define the default as isolated, I was just trying to be careful about backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten this section to clearly define the behavior.

to true in `LinuxContainerSecurityContext` to a call to Docker's create
container request with `PidMode` set to the container ID of the sandbox
container. All containers in the pod will then start in the same PID namespace
as the sandbox container, with the `/pause` binary as PID 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow users to use their own custom sandbox image with pid 1 capabilities?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing preventing users to configure their runtime/shim to use a different image, if the runtime/shim supports it. I don't see we support that in kube/CRI in the near future though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to support such a feature, but I think it would be complicated. That's more towards the "general purpose init" to which I alluded in Non-goals.

The Container Runtime Interface (CRI) will be updated to explicitly support
shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID
namespace for the pod or return an error. When `sharedPID` is false, runtimes
MAY implement isolated PID namespaces. That is, omitting this option defaults to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for MUST.
I'd rather see the behavior clearly defined in CRI.

We will add first class support for configuring isolated and shared PID
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec.
When set to true, all containers in the pod will share a single PID namespace
when supported by runtime. When not supported by the runtime, for example with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runtime behavior should be move to the next paragraph, together with CRI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Making PID namespace configurable requires changing the CRI, which we would
like to avoid since there are no use cases.
We will add first class support for configuring isolated and shared PID
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that default will be False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

...
```

`PodSecurityContext` was chosen because it's where namespace configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the internal API object..... The v1.PodSecurityContext does not contain the PID namespace fields...Could you update the text/code here to reflect that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, you're right. Updated.

to true in `LinuxContainerSecurityContext` to a call to Docker's create
container request with `PidMode` set to the container ID of the sandbox
container. All containers in the pod will then start in the same PID namespace
as the sandbox container, with the `/pause` binary as PID 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing preventing users to configure their runtime/shim to use a different image, if the runtime/shim supports it. I don't see we support that in kube/CRI in the near future though.


Setting both `SharedPID` and `HostPID` will cause a validation error.

### Container Runtime Interface Changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc/ @Random-Liu for containerd and @mrunalp for cri-o on CRI changes.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 22, 2017
2. Modify the Infra container used by the Docker runtime to reap orphaned
zombies ([#36853](https://pr.k8s.io/36853)).
We will add first class support for configuring isolated and shared PID
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name leaves something to be desired, even if it is a mirror of hostPID (which itself is not a great name).

I'd probably prefer a name that covers the two important aspects of the flag:

  1. that none of the containers will be PID 1
  2. that all processes in each container will be able to see each other (and potentially interact)

shareProcessNamespace or something that captures "all containers will be part of the shared namespace" is better than sharedPID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. I like the more descriptive name. We'll need to be sure to document well the relationship between shareProcessNamespace and hostPID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton could you suggest an API reviewer for this change? It'd be nice to loop in that person to the proposal rather than waiting for one to be auto-assigned by the code PR.

1. easier troubleshooting of pods.
1. addressing [Docker's zombie
problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/)
by reaping orphaned zombies in the infra container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note the downsides of enabling this shared mode - sidecar containers that were previously isolated are no longer so, environment variables are now visible to all other processes, any "kill all" semantics used within the process are now broken, exec processes from other containers will now show up, etc. This doc should clarify tradeoffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section here. Let me know if I missed anything.

@smarterclayton
Copy link
Contributor

@mrunalp @runcom can you please comment on this with respect to CRI-O? I'd like to see forward progress on this in 1.10.

// PID namespace for this container/sandbox.
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
// The kubelet's runtime manager will set this to CONTAINER by default for v1 pods.
// Runtimes must support: POD, CONTAINER, NODE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this support TARGET too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up question: do we want other namespaces to also support TARGET?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange for the other namespaces since they don't support isolation. We don't need to support TARGET for things like network & ipc because POD & NODE are sufficient for the new container to be placed in the correct namespace.

// IPC namespace for this container/sandbox.
// Runtimes must support: POD, NODE
NamespaceMode ipc = 3;
// Target Container or Sandbox ID for NamespaceMode of TARGET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to allow Sandbox IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no, and I don't think we want to. Containers should only be able to target their own pod. I'll remove that.

Since we will support isolated PID namespaces indefinitely, we will need to be
able to target the namespace of specific isolated containers to enable
[Troubleshooting Running Pods](troubleshooting-running-pods.md). This is done
with an additional field in `NamespaceOption`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous subsection is more straightforward, but the introduction of TARGET makes the use case more complex. Could you elaborate a bit more on how the debug container would be configured in CRI?

In CRI today, you'd need to create a container in a PodSandbox, i.e., your debug container needs to join an existing PodSandbox, or you'd need to create a new PodSandbox for it. Are you going to allow pod A to join some of pod B's namespaces, or do you always require the debug container to join the PodSandbox in which the target container resides? What combination of NamespaceOptions and PodSandbox configs are valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers should not be able to join a namespace from a different pod. We require the debug container to join the sandbox in which the target container resides. I've updated the propose to be explicit about this.


### Container Runtime Interface Changes

Namespace options in the CRI are currently specified for both `PodSandbox` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRI changes in this subsection looks good to me. It makes the API more clear.

/cc a few CRI runtime maintainers since the change won't be backward compatible.
@Random-Liu @mrunalp @runcom

@mrunalp
Copy link
Contributor

mrunalp commented Jan 9, 2018

We should be able to support this in CRI-O.

@wking
Copy link
Contributor

wking commented Jan 23, 2018

To get the infra-reaping benefits of pod mode with the isolation benefits of container-mode, I'd like to also specify a mode where the pod's worker containers have PID namespaces which are children of the infra container's PID namespace. So:

  • host PID namespace
    • infra PID namespace
      • worker PID namespace
      • worker PID namespace

That's the pod-container choice in the implementation I've floated in cri-o/cri-o#1280.

@verb
Copy link
Contributor Author

verb commented Jan 23, 2018

@wking Interesting, what are the benefits of "pod-container" over "container"? Is it that with pod-container the infra container will inherit orphaned zombies somehow?

@wking
Copy link
Contributor

wking commented Jan 23, 2018

Interesting, what are the benefits of "pod-container" over "container"? Is it that with pod-container the infra container will inherit orphaned zombies somehow?

Assuming there are no subreaper modifications, here's an example process tree with pod-container:

PID (host) PID (infra) PID (container) Reaper (host namespace)
1 - - -
2 1 - 1
3 2 1 2
4 3 2 3

And here's the same setup with container:

PID (host) PID (infra) PID (container) Reaper (host namespace)
1 - - -
2 1 - 1
3 - 1 1
4 - 2 3

And here's the same setup with pod:

PID (host) PID (infra) Reaper (host namespace)
1 - -
2 1 1
3 2 2
4 3 2

But I think the real benefit is that pod-container makes it easy for the infra process(es) to monitor processes in the worker containers (like the pod approach) while making it hard for those worker processes to mess with the infra processes (like the container approach).

@yujuhong
Copy link
Contributor

That's the pod-container choice in the implementation I've floated in cri-o/cri-o#1280.

I think this falls under the purview of the runtime given that we don't restrict how podsandbox must be implemented in CRI.

@yujuhong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2018
@wking
Copy link
Contributor

wking commented Jan 23, 2018 via email

@yujuhong
Copy link
Contributor

@wking yes. As of now, CRI does not dictate how runtime implements the podsandbox.

@dchen1107
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, verb, yujuhong

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2018
@k8s-ci-robot k8s-ci-robot merged commit 483f393 into kubernetes:master Jan 24, 2018
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Update Docker Shared PID proposal for per-pod configuration
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
* Simplfy org configuration

* Give members back triage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.