-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Docker Shared PID proposal for per-pod configuration #1048
Conversation
@verb: Your pull request title starts with "WIP", so the do-not-merge/work-in-progress label will be added. This label will ensure that your pull request will not be merged. Remove the prefix from your pull request title to trigger the removal of the label and allow for your pull request to be merged. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
@kubernetes/sig-node-proposals |
@@ -60,18 +95,11 @@ The ability to disable shared PID namespaces is intended as a way to roll back | |||
to prior behavior in the event of unforeseen problems. It won't be possible to | |||
configure the behavior per-pod. We believe this is acceptable because: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably change this section.
but choosing not to support isolated PID namespaces indefinitely would influence | ||
design choices. In this scenario we would avoid modifying the CRI or | ||
implementing isolated namespaces in other runtimes. Enabling the shared | ||
namespace in v1.Pod could happen in a new field or in an annotation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having problems parsing the two options. What's the difference? Is it just the default?
Please stand by for a full rewrite based on our discussion in sig node.
… |
07dbdc9
to
d765a80
Compare
@smarterclayton @yujuhong @derekwaynecarr This is now updated, internally consistent (I hope) and ready for review. It includes what we discussed previously in the SIG node meetup, which was adding a configuration option to both the Pod & CRI APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About time! :)
* We have not identified a concrete use case requiring isolated PID namespaces. | ||
* Making PID namespace configurable requires changing the CRI, which we would | ||
like to avoid since there are no use cases. | ||
* We have not identified a concrete use case requiring isolated PID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about "I use systemd in my container and systemd pukes when it is not pid 1" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin I'm not sure how this happened, but some of your comments are for an outdated revision and some are current. I suspect you started a review and I updated the doc in the middle of it. Sorry about that.
The latest version of the proposal dropped this section and lists supporting pid == 1 in the goals.
per-pod configurable PID namespaces, and the choice hinges mainly on long-term | ||
plans for PID namespace sharing. | ||
|
||
### Option 1: Configurable Namespaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong What do you think here?
|
||
### Kubernetes API Changes | ||
|
||
`v1.PodSecurityContext` gains a new field resembling the existing `hostPID` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? is shared PID somewhow a security escalation that we might want to deny users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion in kubernetes/kubernetes#48937 and kubernetes/kubernetes#1615, people want to be able to use PID namespace separation as a security feature.
@mrunalp - please review as well. |
The Container Runtime Interface (CRI) will be updated to explicitly support | ||
shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID | ||
namespace for the pod or return an error. When `sharedPID` is false, runtimes | ||
MAY implement isolated PID namespaces. That is, omitting this option defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering why this is left as a MAY. Can't it be MUST?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for MUST.
I'd rather see the behavior clearly defined in CRI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAY is the status quo of no explicit default. I left it MAY for two reasons:
- Do isolated PID namespaces exist in all runtimes? What do hypervisor runtimes do? What about rkt? I'm not sure since I only use Docker.
- Do any current runtimes default to shared PID? Do we care about being backwards compatible with them? If yes, then we can't retroactively define a default behavior for v1.Pod, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong What do you think about an enum for the CRI? Something like: verb/kubernetes@db76a1b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten this section to clearly define the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong What do you think about an enum for the CRI? Something like: verb/kubernetes@db76a1b
Personally I like that better than a set of booleans for toggling the behavior :-) We could name the enum
NamespaceType to be more generic.
That is, however, an API breaking change for CRI.
@mrunalp @Random-Liu @feiskyer thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, great, a default of isolated it is. Good point on the generic name.
Also I realized that we'll need to be able to target a container's namespace for debug containers, which might influence the API design. This could be done as a fourth NamespaceType: verb/kubernetes@48fe4c2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@feiskyer for frakti? @Random-Liu for containerd @mrunalp for cri-o?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At sig-node, we talked about this, and the decision is introducing enum. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong Enum does look cleaner to me. And we could easily change to use the new enum in cri-containerd.
shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID | ||
namespace for the pod or return an error. When `sharedPID` is false, runtimes | ||
MAY implement isolated PID namespaces. That is, omitting this option defaults to | ||
pre-existing runtime behavior. For Docker the default remains isolated PID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the default is not really runtime specific. we need the api default to remain isolated across any runtime at this moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought the rkt & hypervisor runtimes defaulted to shared PID so making isolated the default could break existing config. I'm happy to define the default as isolated, I was just trying to be careful about backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rewritten this section to clearly define the behavior.
to true in `LinuxContainerSecurityContext` to a call to Docker's create | ||
container request with `PidMode` set to the container ID of the sandbox | ||
container. All containers in the pod will then start in the same PID namespace | ||
as the sandbox container, with the `/pause` binary as PID 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to allow users to use their own custom sandbox image with pid 1 capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing preventing users to configure their runtime/shim to use a different image, if the runtime/shim supports it. I don't see we support that in kube/CRI in the near future though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to support such a feature, but I think it would be complicated. That's more towards the "general purpose init" to which I alluded in Non-goals.
The Container Runtime Interface (CRI) will be updated to explicitly support | ||
shared PID namespaces. When `sharedPID` is true, runtimes MUST share a PID | ||
namespace for the pod or return an error. When `sharedPID` is false, runtimes | ||
MAY implement isolated PID namespaces. That is, omitting this option defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for MUST.
I'd rather see the behavior clearly defined in CRI.
We will add first class support for configuring isolated and shared PID | ||
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec. | ||
When set to true, all containers in the pod will share a single PID namespace | ||
when supported by runtime. When not supported by the runtime, for example with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The runtime behavior should be move to the next paragraph, together with CRI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Making PID namespace configurable requires changing the CRI, which we would | ||
like to avoid since there are no use cases. | ||
We will add first class support for configuring isolated and shared PID | ||
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify that default will be False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
... | ||
``` | ||
|
||
`PodSecurityContext` was chosen because it's where namespace configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the internal API object..... The v1.PodSecurityContext does not contain the PID namespace fields...Could you update the text/code here to reflect that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, you're right. Updated.
to true in `LinuxContainerSecurityContext` to a call to Docker's create | ||
container request with `PidMode` set to the container ID of the sandbox | ||
container. All containers in the pod will then start in the same PID namespace | ||
as the sandbox container, with the `/pause` binary as PID 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing preventing users to configure their runtime/shim to use a different image, if the runtime/shim supports it. I don't see we support that in kube/CRI in the near future though.
|
||
Setting both `SharedPID` and `HostPID` will cause a validation error. | ||
|
||
### Container Runtime Interface Changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc/ @Random-Liu for containerd and @mrunalp for cri-o on CRI changes.
2. Modify the Infra container used by the Docker runtime to reap orphaned | ||
zombies ([#36853](https://pr.k8s.io/36853)). | ||
We will add first class support for configuring isolated and shared PID | ||
namespaces for pods by adding a new boolean field `sharedPID` to the pod spec, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name leaves something to be desired, even if it is a mirror of hostPID
(which itself is not a great name).
I'd probably prefer a name that covers the two important aspects of the flag:
- that none of the containers will be PID 1
- that all processes in each container will be able to see each other (and potentially interact)
shareProcessNamespace
or something that captures "all containers will be part of the shared namespace" is better than sharedPID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. I like the more descriptive name. We'll need to be sure to document well the relationship between shareProcessNamespace
and hostPID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton could you suggest an API reviewer for this change? It'd be nice to loop in that person to the proposal rather than waiting for one to be auto-assigned by the code PR.
1. easier troubleshooting of pods. | ||
1. addressing [Docker's zombie | ||
problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/) | ||
by reaping orphaned zombies in the infra container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note the downsides of enabling this shared mode - sidecar containers that were previously isolated are no longer so, environment variables are now visible to all other processes, any "kill all" semantics used within the process are now broken, exec processes from other containers will now show up, etc. This doc should clarify tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section here. Let me know if I missed anything.
// PID namespace for this container/sandbox. | ||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. | ||
// The kubelet's runtime manager will set this to CONTAINER by default for v1 pods. | ||
// Runtimes must support: POD, CONTAINER, NODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this support TARGET too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up question: do we want other namespaces to also support TARGET?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit strange for the other namespaces since they don't support isolation. We don't need to support TARGET for things like network & ipc because POD & NODE are sufficient for the new container to be placed in the correct namespace.
// IPC namespace for this container/sandbox. | ||
// Runtimes must support: POD, NODE | ||
NamespaceMode ipc = 3; | ||
// Target Container or Sandbox ID for NamespaceMode of TARGET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to allow Sandbox IDs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no, and I don't think we want to. Containers should only be able to target their own pod. I'll remove that.
Since we will support isolated PID namespaces indefinitely, we will need to be | ||
able to target the namespace of specific isolated containers to enable | ||
[Troubleshooting Running Pods](troubleshooting-running-pods.md). This is done | ||
with an additional field in `NamespaceOption`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the previous subsection is more straightforward, but the introduction of TARGET makes the use case more complex. Could you elaborate a bit more on how the debug container would be configured in CRI?
In CRI today, you'd need to create a container in a PodSandbox, i.e., your debug container needs to join an existing PodSandbox, or you'd need to create a new PodSandbox for it. Are you going to allow pod A to join some of pod B's namespaces, or do you always require the debug container to join the PodSandbox in which the target container resides? What combination of NamespaceOptions and PodSandbox configs are valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Containers should not be able to join a namespace from a different pod. We require the debug container to join the sandbox in which the target container resides. I've updated the propose to be explicit about this.
|
||
### Container Runtime Interface Changes | ||
|
||
Namespace options in the CRI are currently specified for both `PodSandbox` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRI changes in this subsection looks good to me. It makes the API more clear.
/cc a few CRI runtime maintainers since the change won't be backward compatible.
@Random-Liu @mrunalp @runcom
We should be able to support this in CRI-O. |
To get the infra-reaping benefits of pod mode with the isolation benefits of container-mode, I'd like to also specify a mode where the pod's worker containers have PID namespaces which are children of the infra container's PID namespace. So:
That's the |
@wking Interesting, what are the benefits of "pod-container" over "container"? Is it that with pod-container the infra container will inherit orphaned zombies somehow? |
Assuming there are no subreaper modifications, here's an example process tree with
And here's the same setup with
And here's the same setup with
But I think the real benefit is that |
I think this falls under the purview of the runtime given that we don't restrict how podsandbox must be implemented in CRI. |
/lgtm |
On Tue, Jan 23, 2018 at 10:30:50PM +0000, Yu-Ju Hong wrote:
> That's the pod-container choice in the implementation I've floated
> in cri-o/cri-o#1280.
I think this falls under the purview of the runtime given that we
don't restrict how podsandbox must be implemented in CRI.
Are you saying that both my ‘container’ and ‘pod-container’ approaches
would be valid implementations of this proposal's ‘container’ choice,
and that there would be no standard way for a user to select between
them?
|
@wking yes. As of now, CRI does not dictate how runtime implements the podsandbox. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, verb, yujuhong The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Update Docker Shared PID proposal for per-pod configuration
* Simplfy org configuration * Give members back triage
We will need to support configuring shared PID namespace in docker per-pod.