-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RuntimeClass scheduling] native scheduler support, ready to implement #909
[RuntimeClass scheduling] native scheduler support, ready to implement #909
Conversation
scheduling is to merge the topology constraints from the RuntimeClass into the | ||
PodSpec. Eventually, the controller's responsibilities may grow, such as to | ||
merge in [pod overhead][] or validate feature compatibility. | ||
A new scheduler predicate will manage the RuntimeClass topology. It will lookup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should discuss predicate composition vs adding support of runtimeClass into other existing predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add it to existing predicates, we don't get a useful error message when the pod can't be scheduled, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair - if we merge tolerations into podSpec how can we inform user correctly (UX) when not finding a node that fits ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that changes anything, as the tolerations would be resolved by the same PodToleratesNodeTaints
predicate? But I might be misunderstanding how the debug messages are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that we’d still check the runtimeClass when computing podToleratesNodeTaints
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move forward, and discuss this while implementing
#### Mix-in | ||
|
||
The RuntimeClass topology is merged with the pod's NodeSelector & Tolerations. | ||
The `PodToleratesNodeTaints` predicate will also be extended to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just call from the RuntimeClassPredicate
the PodToleratesNodeTaints
with the right parameters ? composing seems easier than modifying each predicate to include runtimeClass awareness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume predicates are AND'd together, in which case the PodToleratesNodeTaints predicate would also be called without the runtimeclass tolerations, and in that instance fail? I'm not very familiar with the scheduler code though, so I could be totally wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a broader Issue at predicates level, when a predicate is basically a composition of other:
- do we add domain awareness to the existing predicates ?
- do we compose these predicates into domain predicate ?
This might over complicate our predicates definition, and create performance overhead in some cases.
Summarizing feedback from 2019-03-26 sig-node:
/assign @egernst |
@tallclair - I think this has 2 main benefits
my only concern is on the UX side as @bsalamat stated |
// +optional | ||
NodeSelectorTerm []corev1.NodeSelectorRequirement | ||
NodeSelector corev1.NodeSelector | ||
|
||
// tolerations adds tolerations to pods running with this RuntimeClass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit unsure about having tolerations as a part of "topology" in RuntimeClass. This is useful if one doesn't want to provide RuntimeClass on all their pods. If they provide RuntimeClass for all their pods they can just use NodeSelector. If they insist on tainting some nodes, they can use regular Toleration (which is not a part of RuntimeClass). So, I guess my question is "what is the major benefit of having tolerations" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is covered under the Tolerations discussion on line 170:
Tolerations
While NodeSelectors and labels are used for steering pods towards nodes,
[taints and tolerations][taint-and-toleration] are used for steering pods away
from nodes. If every pod had a RuntimeClass and every RuntimeClass had a strict
NodeSelector, then RuntimeClasses could use non-overlapping selectors in place
of taints & tolerations. However the same could be said of regular pod
selectors, yet taints & tolerations are still a useful addition. Examples of
use cases for including tolerations in RuntimeClass topology
inculde:
- Tainting Windows nodes with
windows.microsoft.com
to keep default linux pods
away from the nodes. Windows RuntimeClasses would then need a corresponding
toleration.- Tainting "sandbox" nodes with
sandboxed.kubernetes.io
to keep services
providing privileged node features away from sandboxed workloads. Sandboxed
RuntimeClasses would need a toleration to enable them to be run on those
nodes.
I agree that it's not strictly necessary, but I think it will be useful in practice. I guess both of the use cases listed could be accomplished with a default
RuntimeClass that looked like this:
kind: RuntimeClass
metadata:
name: default
handler: runc
nodeSelector:
nodeSelectorTerms:
matchExpressions:
- key: "windows.microsoft.com"
operator: DoesNotExist
- key: "sandboxed.kubernetes.io"
operator: DoesNotExist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume that the toleration key is also set as label ? I think that this isn't always the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implies using labels rather than taints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is more staightforward for users to implement dedicated nodes using taints and tolerations rather than nodeSelectors.
@bsalamat - The benefit of having tolerations on the runtimeClass is freeing users that want to consume a specific runtimeClass available on a dedicated node to think about which toleration they should set or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your points are valid. Thanks for clarifying.
Integrated feedback from sig-node. In particular, the RuntimeController was added back, and tolerations are back to being a mixin. |
@@ -203,6 +202,36 @@ will be a clear explanation of why. For example: | |||
0/10 Nodes are available: 5 nodes do not have enough memory, 5 nodes don't match RuntimeClass | |||
``` | |||
|
|||
### RuntimeController |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would call this Runtime Admission Controller
to clarify that this is not a controller with a control loop that runs all the time and monitors things in the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the convention of the admission plugins is to just omit the controller piece altogether - so this would just be "Runtime". I'll clarify the name in the KEP though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or rename to RuntimeTopology
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is to reuse the controller for other RuntimeClass-related admission controls. In particular PodOverhead. So I don't think it makes sense to include topology in the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with a single admission plugin for a common feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair - have a suggested controller name in mind? ContainerRuntime? Runtime is pretty overloaded...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with RuntimeClass
for the controller name. Pretty self-explanatory.
/lgtm just a rename or clarification on the name of the admission controller is missing. @bsalamat - any thoughts on merging toleration into podSpec ? This would help enforcing |
Are you saying that instead of having "toleration" in the RuntimeClass, we use the same toleration of Podspec? If so, we discussed it in this KEP and decided that the existence of toleration in the runtime class is useful. |
@tallclair I'd prefer it to be separate so that we can provide better "unschedulable" message to users. |
@bsalamat - I agree
I think we can keep them separate for now. If we see that we need to add tolerations to the podSpec we can do it afterwards. |
@tallclair reached out and asked that i review. i am fine with what is presented. |
I spoke with @bsalamat offline, and we agreed that it makes sense to merge the tolerations into the pod spec, but the If the user experience is the same, merging the tolerations is much simpler (for the reasons outlined about, around handling RC deletion, etc.), and also makes it easier to apply toleration restrictions through policies (e.g. SchedulingPolicy, gatekeeper). |
@yastij WDYT? Does that alleviate your concerns? |
@tallclair - sgtm |
/hold cancel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@tallclair |
Hmm, good point. I wasn't thinking about the case that the RuntimeClass was missing before the pod is scheduled...
I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right? |
I agree here we'll just delay the failure, the predicate should fail and provide meaningful logs/reason. |
That's right. The scheduler will keep retrying (with some backoff time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Thanks, @tallclair!
|
7d9e6d2
to
e99bc99
Compare
Squashed commits. |
/approve we can clarify the admission requirement in a follow-on or during documentation. |
e99bc99
to
152d8a8
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, derekwaynecarr, tallclair, yastij The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updated the RuntimeClass scheduling KEP to use native scheduler support rather than a mixin approach.
This approach uses a cleaner API (doesn't need to mixin a bunch of rules into the pod's node affinity), it let's us use the more expressive
NodeSelector
API, and provides a nice user experience in the error case.I am opening this PR to continue the discussion @bsalamat started here
/sig node scheduling
/assign @bsalamat @derekwaynecarr @dchen1107
/cc @yastij
/milestone v1.15
/priority important-longterm