KEP-4671: Introduce Workload Scheduling Cycle by macsko · Pull Request #136618 · kubernetes/kubernetes

macsko · 2026-01-29T08:42:41Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a pod group scheduling cycle to the scheduler's main loop flow.

TBD in next PR: Add an information to pod status when pod uses inter-pod dependencies or the group is non-homogeneous.

Which issue(s) this PR is related to:

Special notes for your reviewer:

This PR doesn't introduce workload-awareness to the scheduling queue. Other PR will be responsible for that. In this PR, the workload cycle is initiated when any pod from a pod group is popped from the queue and hasn't been yet processed by this cycle.

Does this PR introduce a user-facing change?

Introduced PodGroup scheduling cycle in kube-scheduler to schedule entire PodGroup in one cycle.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-scheduling/4671-gang-scheduling/README.md

k8s-ci-robot · 2026-01-29T08:42:44Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-01-29T08:42:51Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-01-29T08:42:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [macsko]
~~test/integration/scheduler/OWNERS~~ [macsko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

macsko · 2026-01-29T08:48:13Z

/hold
For review

tosi3k · 2026-01-29T08:50:42Z

/cc

macsko · 2026-01-30T10:19:07Z

/test all

macsko · 2026-02-02T07:13:20Z

/test all

…6618

dom4ha · 2026-02-16T14:35:12Z

Overall the PR looks good to me.

Considering that many PRs are blocked on this one and how hard it to keep rebasing them, let's even treat my comments as optional that can be applied in follow up PRs (we need them anyway). I still need to finish tests today review before final LGTM.

Regarding keeping the cycle state in algorithmResults, it seems sufficient for now until we come up with an idea how to keep only the essential information (node nomination + plugin specific decision like DRA allocation) that would be sufficient to run Reserve. It seem to me like a separate work we may want to do independently from this PR.

dom4ha

I finished tests review and besides of small comments it's still LGTM.

dom4ha · 2026-02-16T15:13:21Z

 			expectedUsedPVCSet: sets.New("test-ns/test-pvc1", "test-ns/test-pvc2"),
 		},
+		{
+			name: "Assume and forget in cache, and in snapshot",


Can we have additional test which assumes in snapshot that includes PVCs and PodAffinity? It could be the same like this one, but with Pods with PVC and affinity.

dom4ha · 2026-02-16T16:01:19Z

+					createPods: []*v1.Pod{p1, otherP1, p2, otherP2, p3, otherP3},
+				},
+				{
+					name:                 "Verify the entire gang is now scheduled",


IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).

Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?

IIUC this test is deterministic because of the order in which pods are created and timestamp recorded in the active queue has sufficient precision (we never have two pods with the same timestamp).

Correct, the timestamp is the main factor. Note that such a test would be flaky in the real cluster because the order in which the scheduler processes the pods does not have to reflect the order in which the pods are created (in the scheduling queue, the timestamp is the time when the pod entered the queue). In integration tests, we can safely assume that the event handlers will observe the pods in the same order that they were created.

Can we have exactly the same test but reversing the order of pods creation to prove there is no other constraint driving the scheduler decision?

Good idea, added

dom4ha · 2026-02-16T18:02:43Z

+// This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot.
+func (s *Snapshot) forgetAllAssumedPods(logger klog.Logger) {
+	for _, pod := range s.assumedPods {
+		err := s.ForgetPod(logger, pod)


Since this method is safety check only, can you log an error here whenever there was any pod to forget?

It's already added

dom4ha · 2026-02-16T21:57:57Z

+		metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start))
+	case podGroupWaitingOnPreemption:
+		logger.V(2).Info("Pod group is waiting for preemption", "podGroup", klog.KObj(podGroupInfo), "unschedulablePods", unschedulablePods)
+		metrics.PodGroupUnschedulable(schedFwk.ProfileName(), metrics.SinceInSeconds(start))


Why not report metric under "waiting_on_preemption"?

For pods we only return unschedulable, even if the preemption was initiated. I wanted to clone the same states, but maybe having a separate for waiting on preemption is a good idea. Added

Add integration tests for gang and basic policy workload scheduling Add more tests for cluster snapshot Proceed to binding cycle just after pod group cycle Enforce one scheduler name per pod group, rename workload cycle to pod group cycle Add unit tests for pod group scheduling cycle Run ScheduleOne tests treating pod as part of a pod group Rename NeedsPodGroupCycle to NeedsPodGroupScheduling Observe correct per-pod and per-podgroup metrics during pod group cycle Rename pod group algorithm status to waiting_on_preemption Mention forgotAllAssumedPods is a safety check

dom4ha · 2026-02-17T09:20:38Z

/lgtm
/unhold

Looks great!

k8s-ci-robot · 2026-02-17T09:20:47Z

LGTM label has been added.

Details

Git tree hash: 0be4b06f2df66360ad2e3c3b84db0a0f9e00e7bd

sanposhiho · 2026-02-20T08:39:12Z

/assign

liggitt · 2026-02-20T19:56:06Z

I'm seeing regular panics / integration test failures on TestPodGroupScheduling since this merged

xref https://storage.googleapis.com/k8s-triage/index.html?text=TestPodGroupScheduling

liggitt · 2026-02-20T19:56:54Z

examples at

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/135048/pull-kubernetes-integration/2024920831620026368

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/137171/pull-kubernetes-integration/2024910876800192512

sanposhiho

Great work to craft a new fundamental layer.
Left comments, please open a followup PR as we discussed.

sanposhiho · 2026-02-22T03:43:10Z

@@ -0,0 +1,466 @@
+/*
+Copyright The Kubernetes Authors.


Copyright headers should no longer include the year:

kubernetes/hack/boilerplate/boilerplate.py

Line 191 in 0cf70d1

# After 2025, we no longer allow new files to include the year in the copyright header.

sanposhiho · 2026-02-22T03:49:12Z

+				sched.SchedulingQueue.Done(podInfo.Pod.UID)
+				return
+			}
+			sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())


is it desirable to requeue with Error status?

Feeling like we should requeue it somehow to be requeued with Pod update event. Or, maybe preenqueue

Ideally, such pod groups would be blocked on a PodGroup-level PreEnqueue. This could be done partially when PodGroup queuing is implemented, so I will defer to that time.

For now, I think having an error is more or less okay.

Ok, makes sense.

sanposhiho · 2026-02-22T03:55:37Z

+				sched.SchedulingQueue.Done(podInfo.Pod.UID)
+				return
+			}
+			sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())


bug: err is nil always. I guess you meant to use err from L48

Fixed in #137216

sanposhiho · 2026-02-22T04:02:25Z

+		podGroupInfo, err := sched.podGroupInfoForPod(ctx, podInfo)
+		if err != nil {
+			podFwk, err := sched.frameworkForPod(podInfo.Pod)
+			if err != nil {
+				// This shouldn't happen, because we only accept for scheduling the pods
+				// which specify a scheduler name that matches one of the profiles.
+				klog.FromContext(ctx).Error(err, "Error occurred")
+				sched.SchedulingQueue.Done(podInfo.Pod.UID)
+				return
+			}
+			sched.FailureHandler(ctx, podFwk, podInfo, fwk.AsStatus(err), nil, time.Now())
+			return
+		}


why do it twice (here and scheduleOnePodGroup)

Do you mean the part with frameworkForPod and FailureHandler? It's to correctly return the failed pod to the queue. Here it's called when the podGroupInfoForPod fails (this is a temporary function until we have the pod group queueing). In podGroupInfoForPod it's called when frameworkForPodGroup fails (should be also optimized with pod group queueing). So, I think both are used here temporarily.

nvm, I misread the code - I thought we have two duplicated things doing basically the same

sanposhiho · 2026-02-22T04:05:38Z

+	// Synchronously attempt to find a fit for the pod group.
+	start := time.Now()


why not generate start within podGroupCycle. why does podGroupCycle have to get this as arg

Right, changed in #137216

sanposhiho · 2026-02-22T08:36:52Z

+		} else {
+			pInfo = p.unschedulablePods.get(pod)
+			if pInfo != nil {
+				if pInfo.Gated() {


Wondering should we run PreEnqueue here? I am kinda concerned the race condition here, like

the last pod of the gang is added and this pod is added to activeQ straightaway because this is the last pod.

the last pod is popped immediately

the other pods are not yet ungated yet. (pop was faster than the queue handling pod add event and trigger the QHiint/PreEnqueue for other pods)

because other pods are not ungated, the workload scheduling cycle can only handle the last pod that is just added to the scheduler.

PreEnqueue is supposed to be very lightweight. So, I don't have a perf concern to run it again when pod is gated.

This is a temporary function that should be replaced soon by the proper pod group queueing, so this concern shouldn't reach 1.36 release. With pod group queuing, the pod group will be queueable as long as the PreEnqueue check for all its member pods passes.

sanposhiho · 2026-02-22T08:40:29Z

+		}
+		podGroupInfo.QueuedPodInfos = append(podGroupInfo.QueuedPodInfos, unscheduledPodInfo)
+	}
+	// Sort the pods in deterministic order. First by priority, then by their InitialAttemptTimestamp.


should we actually add some randomness spice? for heterogeneous workloads

According to the KEP discussion, we want to make the algorithm deterministic for now, and the heterogeneous workloads won't be well supported by the scheduler anyway. We can consider having non-determinism, but likely not in v1.36

sanposhiho · 2026-02-22T08:42:51Z

+		if podResult.permitStatus.IsSuccess() {
+			// When the permit returns success for any pod, the pod group is schedulable.
+			if requiresPreemption {
+				// If any preemption is required, the whole pod group requires it to be feasible.


why? that depends on how many pods need to be schedulable (gang), right? or if it is basic, we actually do not need to require preeemption in any case

Good point. After thinking about it, I came up with a few scenarios:

With workload-aware preemption, it's not a problem because, if possible, pods will be scheduled without preemption first.

Without workload-aware preemption, if a pod group is not feasible and a pod requires preemption, the whole pod group requires preemption, as the latter pods may require the preemptor to be scheduled or use the freed-up space.

When a pod group is feasible, but some additional pods require preemption, we could consider binding the previous pods and putting the rest back in the queue (obviously trying to schedule them first), for the same reason as in point above.

Therefore, we could only optimize the third scenario, which would be a temporary solution anyway, overwritten by workload-aware preemption.

Ok, I have some other thoughts around the preemption, but apparently we should discuss it on the workload aware preemption implementation KEP, as the current code is kinda temporal and full of 'will be changed after the workload aware preemption'.

sanposhiho · 2026-02-22T08:48:07Z

+				}
+				go sched.runBindingCycle(ctx, podCtx.state, schedFwk, podResult.scheduleResult, assumedPodInfo, podSchedulingStart, podCtx.podsToActivate)
+				scheduledPods++
+			case podGroupUnschedulable:


I am confused - can this happen? i.e., this pod is schedulable but the whole group is unschedulable.
Looking at podGroupSchedulingDefaultAlgorithm, I thought if a pod is schedulable, it changes the group status to be feasible or waitiing on preemption

Yes, it can happen when the schedulable pods count is less than minCount. Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.

Note that we check here the podResult.status which is Success, not podResult.permitStatus which is Wait.

nice guess on how I misread and got confused

sanposhiho · 2026-02-22T08:58:29Z

+// running the remaining plugins and returns an error. If any of the
+// plugins returns "Wait", this function will NOT create a waiting pod object,
+// but just return status with "Wait" code. It's caller's responsibility to act on that code.
+func (f *frameworkImpl) RunPermitPluginsWithoutWaiting(ctx context.Context, state fwk.CycleState, pod *v1.Pod, nodeName string) (status *fwk.Status) {


+1 to @dom4ha, I don't like this code. We should have one version only. Either we somehow use waiting pod for pod group too or we always create waiting pod outside of this function

macsko · 2026-02-23T12:44:57Z

I'm seeing regular panics / integration test failures on TestPodGroupScheduling since this merged

xref https://storage.googleapis.com/k8s-triage/index.html?text=TestPodGroupScheduling

Fix got merged today: #137194

sanposhiho · 2026-02-23T13:19:14Z

+
+// ForgetPod forgets a given pod from the snapshot.
+// This function is not thread safe, so it should be executed when no other routines can write/read from the snapshot.
+func (s *Snapshot) ForgetPod(logger klog.Logger, pod *v1.Pod) error {


Is it efficient to forget all pods one by one with this func? Can we keep the old nodeInfoMap at the beginning of the workload sched cycle, and revert the cache by restoring it afterwards?

I think cloning the entire nodeInfoMap can be much more time consuming in large clusters. Given that for TAS we would need to revert it per placement, it would be even worse. If the current forgetting logic proves to be inefficient, we can consider having a delta storage or something else for the snapshot.

We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel. We will have to address the problem at that time, but for now it sounds sufficient.

We will have to handle this problem differently once we decide to perform TAS or WAS Preemption scheduling in parallel.

Ok, it convinced me - let's not worry about my point now.

sanposhiho · 2026-02-23T13:21:36Z

+			utilruntime.HandleErrorWithLogger(logger, err, "Failed to forget assumed pod")
+		}
+	}
+	utilruntime.HandleErrorWithLogger(logger, nil, "Found assumed pods in the snapshot that were not forgotten", "assumedPodsCount", len(s.assumedPods))


nit: Does it make sense to use HandleErrorWithLogger? That puts unnecessary backoff wait time. Though I know ideally this error message isn't ship at all at the prod k8s, the wait time will just slow down the scheduling with no benefit

Changed to logger.Error in #137216

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 29, 2026

github-project-automation Bot added this to SIG Scheduling Jan 29, 2026

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 29, 2026

k8s-ci-robot requested review from damemi and kerthcet January 29, 2026 08:43

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026

k8s-ci-robot requested a review from tosi3k January 29, 2026 08:50

macsko force-pushed the workload_scheduling_cycle branch 2 times, most recently from 9c3e09f to b624834 Compare January 29, 2026 09:39

macsko force-pushed the workload_scheduling_cycle branch from b624834 to 83c5c68 Compare January 30, 2026 15:17

Refactor ScheduleOne functions for Workload Scheduling Cycle

1333aa5

brejman reviewed Feb 4, 2026

View reviewed changes

Comment thread pkg/scheduler/framework/types.go Outdated

k8s-ci-robot added the area/workload-aware Categorizes an issue or PR as relevant to Workload-aware and Topology-aware scheduling subprojects. label Feb 15, 2026

brejman added a commit to brejman/kubernetes that referenced this pull request Feb 16, 2026

Change PodGroupInfo to interface for compatibility with kubernetes#13…

7030ca5

…6618

dom4ha reviewed Feb 16, 2026

View reviewed changes

helayoty added this to Workload-aware & Topology-aware Workstream Feb 17, 2026

github-project-automation Bot moved this to Backlog in Workload-aware & Topology-aware Workstream Feb 17, 2026

helayoty moved this from Backlog to In Progress in Workload-aware & Topology-aware Workstream Feb 17, 2026

macsko force-pushed the workload_scheduling_cycle branch from 61fcdc7 to 6233b25 Compare February 17, 2026 09:04

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2026

k8s-ci-robot assigned dom4ha Feb 17, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 17, 2026

k8s-ci-robot merged commit f9c9f03 into kubernetes:master Feb 17, 2026
13 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 17, 2026

github-project-automation Bot moved this from In Progress to Done in Workload-aware & Topology-aware Workstream Feb 17, 2026

github-project-automation Bot moved this from In Progress to Done in SIG Scheduling Feb 17, 2026

dom4ha mentioned this pull request Feb 18, 2026

Reuse pod scheduling signature for opportunistic batching #136579

Merged

Argh4k mentioned this pull request Feb 18, 2026

Add placement generator plugin interfaces and logic for running them #137083

Merged

k8s-ci-robot assigned sanposhiho Feb 20, 2026

sanposhiho reviewed Feb 22, 2026

View reviewed changes

sanposhiho reviewed Feb 23, 2026

View reviewed changes

macsko mentioned this pull request Feb 24, 2026

Adjust pod group scheduling cycle code #137216

Merged

tosi3k mentioned this pull request Mar 11, 2026

Add tosi3k to the SIG Scheduling reviewers #137630

Merged

macsko mentioned this pull request Mar 18, 2026

Gang Scheduling Support in Kubernetes kubernetes/enhancements#4671

Open

24 tasks

		// Synchronously attempt to find a fit for the pod group.
		start := time.Now()

Conversation

macsko commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jan 29, 2026

Uh oh!

k8s-ci-robot commented Jan 29, 2026

Uh oh!

k8s-ci-robot commented Jan 29, 2026

Uh oh!

macsko commented Jan 29, 2026

Uh oh!

tosi3k commented Jan 29, 2026

Uh oh!

macsko commented Jan 30, 2026

Uh oh!

macsko commented Feb 2, 2026

Uh oh!

Uh oh!

dom4ha commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Feb 17, 2026

Uh oh!

k8s-ci-robot commented Feb 17, 2026

Uh oh!

Uh oh!

sanposhiho commented Feb 20, 2026

Uh oh!

liggitt commented Feb 20, 2026

Uh oh!

liggitt commented Feb 20, 2026

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

macsko commented Jan 29, 2026 •

edited

Loading

dom4ha commented Feb 16, 2026 •

edited

Loading