Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RuntimeClass scheduling] native scheduler support, ready to implement #909

Merged
merged 2 commits into from
Apr 4, 2019

Conversation

tallclair
Copy link
Member

Updated the RuntimeClass scheduling KEP to use native scheduler support rather than a mixin approach.

This approach uses a cleaner API (doesn't need to mixin a bunch of rules into the pod's node affinity), it let's us use the more expressive NodeSelector API, and provides a nice user experience in the error case.

I am opening this PR to continue the discussion @bsalamat started here

/sig node scheduling
/assign @bsalamat @derekwaynecarr @dchen1107
/cc @yastij
/milestone v1.15
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone Mar 20, 2019
@k8s-ci-robot k8s-ci-robot requested a review from yastij March 20, 2019 22:48
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 20, 2019
scheduling is to merge the topology constraints from the RuntimeClass into the
PodSpec. Eventually, the controller's responsibilities may grow, such as to
merge in [pod overhead][] or validate feature compatibility.
A new scheduler predicate will manage the RuntimeClass topology. It will lookup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should discuss predicate composition vs adding support of runtimeClass into other existing predicate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it to existing predicates, we don't get a useful error message when the pod can't be scheduled, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair - if we merge tolerations into podSpec how can we inform user correctly (UX) when not finding a node that fits ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that changes anything, as the tolerations would be resolved by the same PodToleratesNodeTaints predicate? But I might be misunderstanding how the debug messages are created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we’d still check the runtimeClass when computing podToleratesNodeTaints ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move forward, and discuss this while implementing

#### Mix-in

The RuntimeClass topology is merged with the pod's NodeSelector & Tolerations.
The `PodToleratesNodeTaints` predicate will also be extended to be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just call from the RuntimeClassPredicate the PodToleratesNodeTaints with the right parameters ? composing seems easier than modifying each predicate to include runtimeClass awareness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume predicates are AND'd together, in which case the PodToleratesNodeTaints predicate would also be called without the runtimeclass tolerations, and in that instance fail? I'm not very familiar with the scheduler code though, so I could be totally wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a broader Issue at predicates level, when a predicate is basically a composition of other:

  • do we add domain awareness to the existing predicates ?
  • do we compose these predicates into domain predicate ?

This might over complicate our predicates definition, and create performance overhead in some cases.

@tallclair
Copy link
Member Author

Summarizing feedback from 2019-03-26 sig-node:

  • Add a section about node label restrictions, and how those should play into the RuntimeClass topology (security considerations)
  • Additional tradeoff on native scheduling vs. mixin: with native scheduling, it's harder to apply scheduling policy. This seems like more of a concern with tolerations than with the node selector, so I think we should actually take a hybrid approach, where tolerations are mixed in but the node selector is applied at scheduling time.

/assign @egernst

@yastij
Copy link
Member

yastij commented Mar 26, 2019

@tallclair - I think this has 2 main benefits

  • This tradeoff aligns with the scheduling policy, especially that enforcement of affinities will be done at scheduling time.

  • This will avoid the issue of predicate composition (If we still want to go down this path, I think we need a KEP for predicate composition)

my only concern is on the UX side as @bsalamat stated

// +optional
NodeSelectorTerm []corev1.NodeSelectorRequirement
NodeSelector corev1.NodeSelector

// tolerations adds tolerations to pods running with this RuntimeClass.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit unsure about having tolerations as a part of "topology" in RuntimeClass. This is useful if one doesn't want to provide RuntimeClass on all their pods. If they provide RuntimeClass for all their pods they can just use NodeSelector. If they insist on tainting some nodes, they can use regular Toleration (which is not a part of RuntimeClass). So, I guess my question is "what is the major benefit of having tolerations" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is covered under the Tolerations discussion on line 170:

Tolerations

While NodeSelectors and labels are used for steering pods towards nodes,
[taints and tolerations][taint-and-toleration] are used for steering pods away
from nodes. If every pod had a RuntimeClass and every RuntimeClass had a strict
NodeSelector, then RuntimeClasses could use non-overlapping selectors in place
of taints & tolerations. However the same could be said of regular pod
selectors, yet taints & tolerations are still a useful addition. Examples of
use cases for including tolerations in RuntimeClass topology
inculde:

  • Tainting Windows nodes with windows.microsoft.com to keep default linux pods
    away from the nodes. Windows RuntimeClasses would then need a corresponding
    toleration.
  • Tainting "sandbox" nodes with sandboxed.kubernetes.io to keep services
    providing privileged node features away from sandboxed workloads. Sandboxed
    RuntimeClasses would need a toleration to enable them to be run on those
    nodes.

I agree that it's not strictly necessary, but I think it will be useful in practice. I guess both of the use cases listed could be accomplished with a default RuntimeClass that looked like this:

kind: RuntimeClass
metadata:
  name: default
handler: runc
nodeSelector:
  nodeSelectorTerms:
    matchExpressions:
      - key: "windows.microsoft.com"
        operator: DoesNotExist
      - key: "sandboxed.kubernetes.io"
        operator: DoesNotExist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that the toleration key is also set as label ? I think that this isn't always the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies using labels rather than taints.

Copy link
Member

@yastij yastij Mar 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is more staightforward for users to implement dedicated nodes using taints and tolerations rather than nodeSelectors.

@bsalamat - The benefit of having tolerations on the runtimeClass is freeing users that want to consume a specific runtimeClass available on a dedicated node to think about which toleration they should set or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your points are valid. Thanks for clarifying.

@tallclair
Copy link
Member Author

Integrated feedback from sig-node. In particular, the RuntimeController was added back, and tolerations are back to being a mixin.

@@ -203,6 +202,36 @@ will be a clear explanation of why. For example:
0/10 Nodes are available: 5 nodes do not have enough memory, 5 nodes don't match RuntimeClass
```

### RuntimeController
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would call this Runtime Admission Controller to clarify that this is not a controller with a control loop that runs all the time and monitors things in the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the convention of the admission plugins is to just omit the controller piece altogether - so this would just be "Runtime". I'll clarify the name in the KEP though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or rename to RuntimeTopology ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is to reuse the controller for other RuntimeClass-related admission controls. In particular PodOverhead. So I don't think it makes sense to include topology in the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with a single admission plugin for a common feature.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallclair - have a suggested controller name in mind? ContainerRuntime? Runtime is pretty overloaded...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with RuntimeClass for the controller name. Pretty self-explanatory.

@yastij
Copy link
Member

yastij commented Apr 1, 2019

/lgtm
/hold

just a rename or clarification on the name of the admission controller is missing.

@bsalamat - any thoughts on merging toleration into podSpec ? This would help enforcing schedulingPolicy

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 1, 2019
@bsalamat
Copy link
Member

bsalamat commented Apr 1, 2019

@bsalamat - any thoughts on merging toleration into podSpec ? This would help enforcing schedulingPolicy

Are you saying that instead of having "toleration" in the RuntimeClass, we use the same toleration of Podspec? If so, we discussed it in this KEP and decided that the existence of toleration in the runtime class is useful.

@tallclair
Copy link
Member Author

@bsalamat I think @yastij was asking about your thoughts on merging the tolerations FROM the RuntimeClass into the PodSpec during admission (what this KEP currently proposes), as opposed to just merging the tolerations at scheduling time (what it originally proposed).

@bsalamat
Copy link
Member

bsalamat commented Apr 1, 2019

@tallclair I'd prefer it to be separate so that we can provide better "unschedulable" message to users.

@yastij
Copy link
Member

yastij commented Apr 1, 2019

@bsalamat - I agree

@tallclair - if we merge tolerations into podSpec how can we inform user correctly (UX) when not finding a node that fits ?

I think we can keep them separate for now. If we see that we need to add tolerations to the podSpec we can do it afterwards.

@derekwaynecarr
Copy link
Member

@tallclair reached out and asked that i review. i am fine with what is presented.

@tallclair
Copy link
Member Author

I spoke with @bsalamat offline, and we agreed that it makes sense to merge the tolerations into the pod spec, but the RuntimeClass.Topology.NodeSelector will still be handled in a new scheduler predicate. There isn't a UX benefit to handling the RuntimeClass tolerations in the scheduler, because they still need to be handled in the same PodHandlesNodeTaints predicate, meaning the error message is still the same.

If the user experience is the same, merging the tolerations is much simpler (for the reasons outlined about, around handling RC deletion, etc.), and also makes it easier to apply toleration restrictions through policies (e.g. SchedulingPolicy, gatekeeper).

@tallclair
Copy link
Member Author

@yastij WDYT? Does that alleviate your concerns?

@yastij
Copy link
Member

yastij commented Apr 1, 2019

@tallclair - sgtm

@yastij
Copy link
Member

yastij commented Apr 1, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2019
Copy link

@egernst egernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@bsalamat
Copy link
Member

bsalamat commented Apr 2, 2019

@tallclair
@egernst has a good point. What will happen if a RuntimeClass is deleted after pod creation? I would like to clarify the behavior of the scheduler for pods that are created but not scheduled yet. I am not concerned about already running pods.
When the scheduler sees a pod with a non-existing RuntimeClass, what should it do? I think it is reasonable to assume that no RuntimeClass is specified and always pass the predicate. Another obvious option is to always fail the RuntimeClass predicate and never schedule the pod. Anyway, I think this should be specified in the KEP.

@tallclair
Copy link
Member Author

Hmm, good point. I wasn't thinking about the case that the RuntimeClass was missing before the pod is scheduled...

Another obvious option is to always fail the RuntimeClass predicate and never schedule the pod.

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

@yastij
Copy link
Member

yastij commented Apr 2, 2019

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

I agree here we'll just delay the failure, the predicate should fail and provide meaningful logs/reason.

@bsalamat
Copy link
Member

bsalamat commented Apr 2, 2019

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

That's right. The scheduler will keep retrying (with some backoff time).

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019
Copy link
Member

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Thanks, @tallclair!

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 4, 2019
@tallclair tallclair changed the title RFC: [RuntimeClass scheduling] native scheduler support [RuntimeClass scheduling] native scheduler support, ready to implement Apr 4, 2019
@tallclair
Copy link
Member Author

  • Added failure modes for the scheduler & admission controller
  • Renamed RuntimeController to RuntimeClass (admission controller)
  • Marked KEP as implementable - I am happy with it's current state, and think we should commence coding.

@tallclair
Copy link
Member Author

Squashed commits.

@derekwaynecarr
Copy link
Member

/approve

we can clarify the admission requirement in a follow-on or during documentation.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2019
@tallclair tallclair force-pushed the runtimeclass-scheduling branch from e99bc99 to 152d8a8 Compare April 4, 2019 18:39
@yastij
Copy link
Member

yastij commented Apr 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 4, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, derekwaynecarr, tallclair, yastij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 246c106 into kubernetes:master Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants