Skip to content

KEP-4671: Introduce Workload Scheduling Cycle#136618

Merged
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
macsko:workload_scheduling_cycle
Feb 17, 2026
Merged

KEP-4671: Introduce Workload Scheduling Cycle#136618
k8s-ci-robot merged 2 commits intokubernetes:masterfrom
macsko:workload_scheduling_cycle

Conversation

@macsko
Copy link
Copy Markdown
Member

@macsko macsko commented Jan 29, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a pod group scheduling cycle to the scheduler's main loop flow.

TBD in next PR: Add an information to pod status when pod uses inter-pod dependencies or the group is non-homogeneous.

Which issue(s) this PR is related to:

Special notes for your reviewer:

This PR doesn't introduce workload-awareness to the scheduling queue. Other PR will be responsible for that. In this PR, the workload cycle is initiated when any pod from a pod group is popped from the queue and hasn't been yet processed by this cycle.

Does this PR introduce a user-facing change?

Introduced PodGroup scheduling cycle in kube-scheduler to schedule entire PodGroup in one cycle.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4671-gang-scheduling/README.md

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 29, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 29, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jan 29, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 29, 2026
@macsko
Copy link
Copy Markdown
Member Author

macsko commented Jan 29, 2026

/hold
For review

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
@tosi3k
Copy link
Copy Markdown
Member

tosi3k commented Jan 29, 2026

/cc

@k8s-ci-robot k8s-ci-robot requested a review from tosi3k January 29, 2026 08:50
@macsko macsko force-pushed the workload_scheduling_cycle branch 2 times, most recently from 9c3e09f to b624834 Compare January 29, 2026 09:39
@macsko
Copy link
Copy Markdown
Member Author

macsko commented Jan 30, 2026

/test all

@macsko macsko force-pushed the workload_scheduling_cycle branch from b624834 to 83c5c68 Compare January 30, 2026 15:17
@macsko
Copy link
Copy Markdown
Member Author

macsko commented Feb 2, 2026

/test all

Comment thread pkg/scheduler/framework/types.go Outdated
@k8s-ci-robot k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label Feb 15, 2026
brejman added a commit to brejman/kubernetes that referenced this pull request Feb 16, 2026
@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Feb 16, 2026

Overall the PR looks good to me.

Considering that many PRs are blocked on this one and how hard it to keep rebasing them, let's even treat my comments as optional that can be applied in follow up PRs (we need them anyway). I still need to finish tests today review before final LGTM.

Regarding keeping the cycle state in algorithmResults, it seems sufficient for now until we come up with an idea how to keep only the essential information (node nomination + plugin specific decision like DRA allocation) that would be sufficient to run Reserve. It seem to me like a separate work we may want to do independently from this PR.

Copy link
Copy Markdown
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished tests review and besides of small comments it's still LGTM.

expectedUsedPVCSet: sets.New("test-ns/test-pvc1", "test-ns/test-pvc2"),
},
{
name: "Assume and forget in cache, and in snapshot",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have additional test which assumes in snapshot that includes PVCs and PodAffinity? It could be the same like this one, but with Pods with PVC and affinity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

createPods: []*v1.Pod{p1, otherP1, p2, otherP2, p3, otherP3},
},
{
name: "Verify the entire gang is now scheduled",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).

Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).

Correct, the timestamp is the main factor. Note that such a test would be flaky in the real cluster because the order in which the scheduler processes the pods does not have to reflect the order in which the pods are created (in the scheduling queue, the timestamp is the time when the pod entered the queue). In integration tests, we can safely assume that the event handlers will observe the pods in the same order that they were created.

Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?

Good idea, added

Comment thread pkg/scheduler/schedule_one_test.go Outdated
Comment thread pkg/scheduler/schedule_one_test.go Outdated
// This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot.
func (s *Snapshot) forgetAllAssumedPods(logger klog.Logger) {
for _, pod := range s.assumedPods {
err := s.ForgetPod(logger, pod)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method is safety check only, can you log an error here whenever there was any pod to forget?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already added

Comment thread pkg/scheduler/schedule_one_podgroup_test.go Outdated
Comment thread pkg/scheduler/schedule_one_podgroup_test.go Outdated
Comment thread pkg/scheduler/schedule_one_podgroup.go Outdated
metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start))
case podGroupWaitingOnPreemption:
logger.V(2).Info("Pod group is waiting for preemption", "podGroup", klog.KObj(podGroupInfo), "unschedulablePods", unschedulablePods)
metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not report metric under "waiting_on_preemption"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pods we only return unschedulable, even if the preemption was initiated. I wanted to clone the same states, but maybe having a separate for waiting on preemption is a good idea. Added

Add integration tests for gang and basic policy workload scheduling

Add more tests for cluster snapshot

Proceed to binding cycle just after pod group cycle

Enforce one scheduler name per pod group, rename workload cycle to pod group cycle

Add unit tests for pod group scheduling cycle

Run ScheduleOne tests treating pod as part of a pod group

Rename NeedsPodGroupCycle to NeedsPodGroupScheduling

Observe correct per-pod and per-podgroup metrics during pod group cycle

Rename pod group algorithm status to waiting_on_preemption

Mention forgotAllAssumedPods is a safety check
@macsko macsko force-pushed the workload_scheduling_cycle branch from 61fcdc7 to 6233b25 Compare February 17, 2026 09:04
@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Feb 17, 2026

/lgtm
/unhold

Looks great!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 0be4b06f2df66360ad2e3c3b84db0a0f9e00e7bd

@k8s-ci-robot k8s-ci-robot merged commit f9c9f03 into kubernetes:master Feb 17, 2026
13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 17, 2026
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SIG Scheduling Feb 17, 2026
@sanposhiho
Copy link
Copy Markdown
Member

/assign

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Feb 20, 2026

I'm seeing regular panics / integration test failures on TestPodGroupScheduling since this merged

xref https://storage.googleapis.com/k8s-triage/index.html?text=TestPodGroupScheduling

Copy link
Copy Markdown
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work to craft a new fundamental layer.
Left comments, please open a followup PR as we discussed.

@@ -0,0 +1,466 @@
/*
Copyright The Kubernetes Authors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: year?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright headers should no longer include the year:

# After 2025, we no longer allow new files to include the year in the copyright header.

sched.SchedulingQueue.Done(podInfo.Pod.UID)
return
}
sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it desirable to requeue with Error status?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feeling like we should requeue it somehow to be requeued with Pod update event. Or, maybe preenqueue

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, such pod groups would be blocked on a PodGroup-level PreEnqueue. This could be done partially when PodGroup queuing is implemented, so I will defer to that time.

For now, I think having an error is more or less okay.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

sched.SchedulingQueue.Done(podInfo.Pod.UID)
return
}
sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug: err is nil always. I guess you meant to use err from L48

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #137216

Comment on lines +78 to +90
podGroupInfo, err := sched.podGroupInfoForPod(ctx, podInfo)
if err != nil {
podFwk, err := sched.frameworkForPod(podInfo.Pod)
if err != nil {
// This shouldn't happen, because we only accept for scheduling the pods
// which specify a scheduler name that matches one of the profiles.
klog.FromContext(ctx).Error(err, "Error occurred")
sched.SchedulingQueue.Done(podInfo.Pod.UID)
return
}
sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())
return
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do it twice (here and scheduleOnePodGroup)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the part with frameworkForPod and FailureHandler? It's to correctly return the failed pod to the queue. Here it's called when the podGroupInfoForPod fails (this is a temporary function until we have the pod group queueing). In podGroupInfoForPod it's called when frameworkForPodGroup fails (should be also optimized with pod group queueing). So, I think both are used here temporarily.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I misread the code - I thought we have two duplicated things doing basically the same

Comment on lines +73 to +74
// Synchronously attempt to find a fit for the pod group.
start := time.Now()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not generate start within podGroupCycle. why does podGroupCycle have to get this as arg

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, changed in #137216

} else {
pInfo = p.unschedulablePods.get(pod)
if pInfo != nil {
if pInfo.Gated() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering should we run PreEnqueue here? I am kinda concerned the race condition here, like

  1. the last pod of the gang is added and this pod is added to activeQ straightaway because this is the last pod.
  2. the last pod is popped immediately
  3. the other pods are not yet ungated yet. (pop was faster than the queue handling pod add event and trigger the QHiint/PreEnqueue for other pods)
  4. because other pods are not ungated, the workload scheduling cycle can only handle the last pod that is just added to the scheduler.

PreEnqueue is supposed to be very lightweight. So, I don't have a perf concern to run it again when pod is gated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary function that should be replaced soon by the proper pod group queueing, so this concern shouldn't reach 1.36 release. With pod group queuing, the pod group will be queueable as long as the PreEnqueue check for all its member pods passes.

}
podGroupInfo.QueuedPodInfos = append(podGroupInfo.QueuedPodInfos, unscheduledPodInfo)
}
// Sort the pods in deterministic order. First by priority, then by their InitialAttemptTimestamp.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we actually add some randomness spice? for heterogeneous workloads

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the KEP discussion, we want to make the algorithm deterministic for now, and the heterogeneous workloads won't be well supported by the scheduler anyway. We can consider having non-determinism, but likely not in v1.36

if podResult.permitStatus.IsSuccess() {
// When the permit returns success for any pod, the pod group is schedulable.
if requiresPreemption {
// If any preemption is required, the whole pod group requires it to be feasible.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? that depends on how many pods need to be schedulable (gang), right? or if it is basic, we actually do not need to require preeemption in any case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. After thinking about it, I came up with a few scenarios:

  • With workload-aware preemption, it's not a problem because, if possible, pods will be scheduled without preemption first.
  • Without workload-aware preemption, if a pod group is not feasible and a pod requires preemption, the whole pod group requires preemption, as the latter pods may require the preemptor to be scheduled or use the freed-up space.
  • When a pod group is feasible, but some additional pods require preemption, we could consider binding the previous pods and putting the rest back in the queue (obviously trying to schedule them first), for the same reason as in point above.

Therefore, we could only optimize the third scenario, which would be a temporary solution anyway, overwritten by workload-aware preemption.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have some other thoughts around the preemption, but apparently we should discuss it on the workload aware preemption implementation KEP, as the current code is kinda temporal and full of 'will be changed after the workload aware preemption'.

}
go sched.runBindingCycle(ctx, podCtx.state, schedFwk, podResult.scheduleResult, assumedPodInfo, podSchedulingStart, podCtx.podsToActivate)
scheduledPods++
case podGroupUnschedulable:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused - can this happen? i.e., this pod is schedulable but the whole group is unschedulable.
Looking at podGroupSchedulingDefaultAlgorithm, I thought if a pod is schedulable, it changes the group status to be feasible or waitiing on preemption

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can happen when the schedulable pods count is less than minCount. Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.

nice guess on how I misread and got confused

// running the remaining plugins and returns an error. If any of the
// plugins returns "Wait", this function will NOT create a waiting pod object,
// but just return status with "Wait" code. It's caller's responsibility to act on that code.
func (f *frameworkImpl) RunPermitPluginsWithoutWaiting(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeName string) (status *fwk.Status) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @dom4ha, I don't like this code. We should have one version only. Either we somehow use waiting pod for pod group too or we always create waiting pod outside of this function

@macsko
Copy link
Copy Markdown
Member Author

macsko commented Feb 23, 2026

I'm seeing regular panics / integration test failures on TestPodGroupScheduling since this merged

xref https://storage.googleapis.com/k8s-triage/index.html?text=TestPodGroupScheduling

Fix got merged today: #137194


// ForgetPod forgets a given pod from the snapshot.
// This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot.
func (s *Snapshot) ForgetPod(logger klog.Logger, pod *v1.Pod) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it efficient to forget all pods one by one with this func? Can we keep the old nodeInfoMap at the beginning of the workload sched cycle, and revert the cache by restoring it afterwards?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cloning the entire nodeInfoMap can be much more time consuming in large clusters. Given that for TAS we would need to revert it per placement, it would be even worse. If the current forgetting logic proves to be inefficient, we can consider having a delta storage or something else for the snapshot.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel. We will have to address the problem at that time, but for now it sounds sufficient.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel.

Ok, it convinced me - let's not worry about my point now.

utilruntime.HandleErrorWithLogger(logger, err, "Failed to forget assumed pod")
}
}
utilruntime.HandleErrorWithLogger(logger, nil, "Found assumed pods in the snapshot that were not forgotten", "assumedPodsCount", len(s.assumedPods))
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does it make sense to use HandleErrorWithLogger? That puts unnecessary backoff wait time. Though I know ideally this error message isn't ship at all at the prod k8s, the wait time will just slow down the scheduling with no benefit

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to logger.Error in #137216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.