KEP-4671: Introduce Workload Scheduling Cycle#136618
KEP-4671: Introduce Workload Scheduling Cycle#136618k8s-ci-robot merged 2 commits intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: macsko The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
/cc |
9c3e09f to
b624834
Compare
|
/test all |
b624834 to
83c5c68
Compare
|
/test all |
|
Overall the PR looks good to me. Considering that many PRs are blocked on this one and how hard it to keep rebasing them, let's even treat my comments as optional that can be applied in follow up PRs (we need them anyway). I still need to finish tests today review before final LGTM. Regarding keeping the cycle state in |
dom4ha
left a comment
There was a problem hiding this comment.
I finished tests review and besides of small comments it's still LGTM.
| expectedUsedPVCSet: sets.New("test-ns/test-pvc1", "test-ns/test-pvc2"), | ||
| }, | ||
| { | ||
| name: "Assume and forget in cache, and in snapshot", |
There was a problem hiding this comment.
Can we have additional test which assumes in snapshot that includes PVCs and PodAffinity? It could be the same like this one, but with Pods with PVC and affinity.
| createPods: []*v1.Pod{p1, otherP1, p2, otherP2, p3, otherP3}, | ||
| }, | ||
| { | ||
| name: "Verify the entire gang is now scheduled", |
There was a problem hiding this comment.
IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).
Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?
There was a problem hiding this comment.
IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).
Correct, the timestamp is the main factor. Note that such a test would be flaky in the real cluster because the order in which the scheduler processes the pods does not have to reflect the order in which the pods are created (in the scheduling queue, the timestamp is the time when the pod entered the queue). In integration tests, we can safely assume that the event handlers will observe the pods in the same order that they were created.
Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?
Good idea, added
| // This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot. | ||
| func (s *Snapshot) forgetAllAssumedPods(logger klog.Logger) { | ||
| for _, pod := range s.assumedPods { | ||
| err := s.ForgetPod(logger, pod) |
There was a problem hiding this comment.
Since this method is safety check only, can you log an error here whenever there was any pod to forget?
| metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start)) | ||
| case podGroupWaitingOnPreemption: | ||
| logger.V(2).Info("Pod group is waiting for preemption", "podGroup", klog.KObj(podGroupInfo), "unschedulablePods", unschedulablePods) | ||
| metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start)) |
There was a problem hiding this comment.
Why not report metric under "waiting_on_preemption"?
There was a problem hiding this comment.
For pods we only return unschedulable, even if the preemption was initiated. I wanted to clone the same states, but maybe having a separate for waiting on preemption is a good idea. Added
Add integration tests for gang and basic policy workload scheduling Add more tests for cluster snapshot Proceed to binding cycle just after pod group cycle Enforce one scheduler name per pod group, rename workload cycle to pod group cycle Add unit tests for pod group scheduling cycle Run ScheduleOne tests treating pod as part of a pod group Rename NeedsPodGroupCycle to NeedsPodGroupScheduling Observe correct per-pod and per-podgroup metrics during pod group cycle Rename pod group algorithm status to waiting_on_preemption Mention forgotAllAssumedPods is a safety check
61fcdc7 to
6233b25
Compare
|
/lgtm Looks great! |
|
LGTM label has been added. DetailsGit tree hash: 0be4b06f2df66360ad2e3c3b84db0a0f9e00e7bd |
|
/assign |
|
I'm seeing regular panics / integration test failures on TestPodGroupScheduling since this merged |
|
examples at |
sanposhiho
left a comment
There was a problem hiding this comment.
Great work to craft a new fundamental layer.
Left comments, please open a followup PR as we discussed.
| @@ -0,0 +1,466 @@ | |||
| /* | |||
| Copyright The Kubernetes Authors. | |||
There was a problem hiding this comment.
Copyright headers should no longer include the year:
kubernetes/hack/boilerplate/boilerplate.py
Line 191 in 0cf70d1
| sched.SchedulingQueue.Done(podInfo.Pod.UID) | ||
| return | ||
| } | ||
| sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now()) |
There was a problem hiding this comment.
is it desirable to requeue with Error status?
There was a problem hiding this comment.
Feeling like we should requeue it somehow to be requeued with Pod update event. Or, maybe preenqueue
There was a problem hiding this comment.
Ideally, such pod groups would be blocked on a PodGroup-level PreEnqueue. This could be done partially when PodGroup queuing is implemented, so I will defer to that time.
For now, I think having an error is more or less okay.
| sched.SchedulingQueue.Done(podInfo.Pod.UID) | ||
| return | ||
| } | ||
| sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now()) |
There was a problem hiding this comment.
bug: err is nil always. I guess you meant to use err from L48
| podGroupInfo, err := sched.podGroupInfoForPod(ctx, podInfo) | ||
| if err != nil { | ||
| podFwk, err := sched.frameworkForPod(podInfo.Pod) | ||
| if err != nil { | ||
| // This shouldn't happen, because we only accept for scheduling the pods | ||
| // which specify a scheduler name that matches one of the profiles. | ||
| klog.FromContext(ctx).Error(err, "Error occurred") | ||
| sched.SchedulingQueue.Done(podInfo.Pod.UID) | ||
| return | ||
| } | ||
| sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now()) | ||
| return | ||
| } |
There was a problem hiding this comment.
why do it twice (here and scheduleOnePodGroup)
There was a problem hiding this comment.
Do you mean the part with frameworkForPod and FailureHandler? It's to correctly return the failed pod to the queue. Here it's called when the podGroupInfoForPod fails (this is a temporary function until we have the pod group queueing). In podGroupInfoForPod it's called when frameworkForPodGroup fails (should be also optimized with pod group queueing). So, I think both are used here temporarily.
There was a problem hiding this comment.
nvm, I misread the code - I thought we have two duplicated things doing basically the same
| // Synchronously attempt to find a fit for the pod group. | ||
| start := time.Now() |
There was a problem hiding this comment.
why not generate start within podGroupCycle. why does podGroupCycle have to get this as arg
| } else { | ||
| pInfo = p.unschedulablePods.get(pod) | ||
| if pInfo != nil { | ||
| if pInfo.Gated() { |
There was a problem hiding this comment.
Wondering should we run PreEnqueue here? I am kinda concerned the race condition here, like
- the last pod of the gang is added and this pod is added to activeQ straightaway because this is the last pod.
- the last pod is popped immediately
- the other pods are not yet ungated yet. (pop was faster than the queue handling pod add event and trigger the QHiint/PreEnqueue for other pods)
- because other pods are not ungated, the workload scheduling cycle can only handle the last pod that is just added to the scheduler.
PreEnqueue is supposed to be very lightweight. So, I don't have a perf concern to run it again when pod is gated.
There was a problem hiding this comment.
This is a temporary function that should be replaced soon by the proper pod group queueing, so this concern shouldn't reach 1.36 release. With pod group queuing, the pod group will be queueable as long as the PreEnqueue check for all its member pods passes.
| } | ||
| podGroupInfo.QueuedPodInfos = append(podGroupInfo.QueuedPodInfos, unscheduledPodInfo) | ||
| } | ||
| // Sort the pods in deterministic order. First by priority, then by their InitialAttemptTimestamp. |
There was a problem hiding this comment.
should we actually add some randomness spice? for heterogeneous workloads
There was a problem hiding this comment.
According to the KEP discussion, we want to make the algorithm deterministic for now, and the heterogeneous workloads won't be well supported by the scheduler anyway. We can consider having non-determinism, but likely not in v1.36
| if podResult.permitStatus.IsSuccess() { | ||
| // When the permit returns success for any pod, the pod group is schedulable. | ||
| if requiresPreemption { | ||
| // If any preemption is required, the whole pod group requires it to be feasible. |
There was a problem hiding this comment.
why? that depends on how many pods need to be schedulable (gang), right? or if it is basic, we actually do not need to require preeemption in any case
There was a problem hiding this comment.
Good point. After thinking about it, I came up with a few scenarios:
- With workload-aware preemption, it's not a problem because, if possible, pods will be scheduled without preemption first.
- Without workload-aware preemption, if a pod group is not feasible and a pod requires preemption, the whole pod group requires preemption, as the latter pods may require the preemptor to be scheduled or use the freed-up space.
- When a pod group is feasible, but some additional pods require preemption, we could consider binding the previous pods and putting the rest back in the queue (obviously trying to schedule them first), for the same reason as in point above.
Therefore, we could only optimize the third scenario, which would be a temporary solution anyway, overwritten by workload-aware preemption.
There was a problem hiding this comment.
Ok, I have some other thoughts around the preemption, but apparently we should discuss it on the workload aware preemption implementation KEP, as the current code is kinda temporal and full of 'will be changed after the workload aware preemption'.
| } | ||
| go sched.runBindingCycle(ctx, podCtx.state, schedFwk, podResult.scheduleResult, assumedPodInfo, podSchedulingStart, podCtx.podsToActivate) | ||
| scheduledPods++ | ||
| case podGroupUnschedulable: |
There was a problem hiding this comment.
I am confused - can this happen? i.e., this pod is schedulable but the whole group is unschedulable.
Looking at podGroupSchedulingDefaultAlgorithm, I thought if a pod is schedulable, it changes the group status to be feasible or waitiing on preemption
There was a problem hiding this comment.
Yes, it can happen when the schedulable pods count is less than minCount. Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.
There was a problem hiding this comment.
Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.
nice guess on how I misread and got confused
| // running the remaining plugins and returns an error. If any of the | ||
| // plugins returns "Wait", this function will NOT create a waiting pod object, | ||
| // but just return status with "Wait" code. It's caller's responsibility to act on that code. | ||
| func (f *frameworkImpl) RunPermitPluginsWithoutWaiting(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeName string) (status *fwk.Status) { |
There was a problem hiding this comment.
+1 to @dom4ha, I don't like this code. We should have one version only. Either we somehow use waiting pod for pod group too or we always create waiting pod outside of this function
Fix got merged today: #137194 |
|
|
||
| // ForgetPod forgets a given pod from the snapshot. | ||
| // This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot. | ||
| func (s *Snapshot) ForgetPod(logger klog.Logger, pod *v1.Pod) error { |
There was a problem hiding this comment.
Is it efficient to forget all pods one by one with this func? Can we keep the old nodeInfoMap at the beginning of the workload sched cycle, and revert the cache by restoring it afterwards?
There was a problem hiding this comment.
I think cloning the entire nodeInfoMap can be much more time consuming in large clusters. Given that for TAS we would need to revert it per placement, it would be even worse. If the current forgetting logic proves to be inefficient, we can consider having a delta storage or something else for the snapshot.
There was a problem hiding this comment.
We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel. We will have to address the problem at that time, but for now it sounds sufficient.
There was a problem hiding this comment.
We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel.
Ok, it convinced me - let's not worry about my point now.
| utilruntime.HandleErrorWithLogger(logger, err, "Failed to forget assumed pod") | ||
| } | ||
| } | ||
| utilruntime.HandleErrorWithLogger(logger, nil, "Found assumed pods in the snapshot that were not forgotten", "assumedPodsCount", len(s.assumedPods)) |
There was a problem hiding this comment.
nit: Does it make sense to use HandleErrorWithLogger? That puts unnecessary backoff wait time. Though I know ideally this error message isn't ship at all at the prod k8s, the wait time will just slow down the scheduling with no benefit
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a pod group scheduling cycle to the scheduler's main loop flow.
TBD in next PR: Add an information to pod status when pod uses inter-pod dependencies or the group is non-homogeneous.
Which issue(s) this PR is related to:
Special notes for your reviewer:
This PR doesn't introduce workload-awareness to the scheduling queue. Other PR will be responsible for that. In this PR, the workload cycle is initiated when any pod from a pod group is popped from the queue and hasn't been yet processed by this cycle.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: