KEP-5501: Reflect PreEnqueue rejections in Pod status by macsko · Pull Request #5510 · kubernetes/enhancements

macsko · 2025-09-01T13:00:52Z

One-line PR description: Add a KEP-5501

Issue link: Reflect PreEnqueue rejections in pod status #5501

Other comments:

macsko · 2025-09-01T13:01:15Z

/assign @dom4ha @sanposhiho
/cc @ania-borowiec

macsko · 2025-09-01T13:01:39Z

/hold
To make sure it's approved by SIG Scheduling leads

ania-borowiec

Looks good!

ania-borowiec · 2025-09-03T14:42:09Z

+  - Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.
+
+- **Cons:**
+  - Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status


Yes it could be inconsistency, but is that any harm really? If there is a need to display a human-readable message to the user, then an empty rejection message can be interpreted as some default message (e.g. saying something about possible transient errors or other potential vague reasons).

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Yes. See the DRA case (kubernetes/kubernetes#129698 (comment)) and the explanation in the How should plugins provide the status message? section:

This flexibility is important for plugins that wish to avoid reporting transient rejections.
For example, the DynamicResources plugin might observe a rejection because the scheduler processes a Pod faster
than a ResourceClaim becomes visible through the watcher (see a comment).
In such cases where the condition is expected to resolve in seconds, populating the status would be inappropriate noise.

ania-borowiec · 2025-09-03T15:41:06Z

+
+- **Alternatives Considered:**
+  - Actively clearing the message: The scheduler could clear the condition if,
+    on a subsequent check, the original rejecting plugin no longer rejects the pod.


Would simply checking the diff between states "before" and "now" be enough?

You said the condition would be cleared if the original rejecting plugin no longer rejects the pod.
I am suggesting that it may be simpler than that: that updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

But that looks like the current proposal, isn't it? The only case when we wouldn't do this is transition from Plugin A (that reported a message) to Plugin B (that didn't report).

Yes, that's my point. My thinking is that it's better to display some generic default message for failing Plugin B, than to keep displaying a no-longer-correct message saying that Plugin A failed.

ania-borowiec · 2025-09-03T15:43:00Z

+  - Actively clearing the message: The scheduler could clear the condition if,
+    on a subsequent check, the original rejecting plugin no longer rejects the pod.
+    This would provide a more accurate real-time status but could cause confusion
+    if the condition appears and disappears while the Pod remains `Pending` for another, unreported reason.


Maybe displaying a default message to the user in case of an empty status (as mentioned in my comment above) would help solve this?
If our goal is to make the messaging meaningful and informative to users, then an explanatory default message could possibly help make things clearer for them?

I'm not sure if some default message would be that informative. It won't answer clearly why the pod is on hold. I believe, the slate message might be better than sending such placeholder. And, it would reduce the number of API calls sent.

As I wrote above - my personal opinion is that being not-very-informative is better than being misleading.
I might be wrong, but I assume that if a user looks at the state message, they are inclined to investigate the reason behind the pending state, e.g. check if indeed some resource is not ready. And if they see that the resource is ready and the pod still keeps waiting for it, they might consider this a bug in k8s and report it, causing more work triaging bugs and poor user experience.

Please note that in the case of the Unschedulable reason, when the pod is rejected by filters, we don't remove this message. It is displayed in the Pod status until it's retried. I'm not sure if we can treat a similar mechanism as misleading in the case of PreEnqueue.

That is a fair point, but Unschedulable reason reflects that the pod is in unschedulable queue (or pending a scheduling cycle) and the scheduler does not yet know if the pod will become schedulable.
In case of PreEnqueue the scheduler does know that the specific PreEnqueue plugin is no longer failing, but in the proposed scenario the scheduler doesn't display this knowledge to the user.

That's right, but from the user's POV, the Unschedulable status may be outdated, especially when the scheduling queue contains plenty of pods. So, I expect users to be aware that the longer the status remains unchanged, the more likely it is to be outdated.

Obviously, the NonReadyForScheduling reason could have its own semantic. I'm just still not sure what's the best choice here.

Is there any sort of an "average scenario" where we could measure how often pods actually fail scheduling in preenqueue? Or would that look very different for various users?

It depends on the scenario, but:

For SchedulingGates, PreEnqueue fails until all the gates are removed (so could be multiple times per pod)

For DefaultPreemption, PreEnqueue fails until the API calls for all preemption's victims are made (this shouldn't take really long, so maybe 1 PreEnqueue failure per pod is a good assumption)

For DRA, PreEnqueue fails until all the ResourceClaims are available. Usually, there is one failure per pod, unless someone forgets to create the ResourceClaim 😃

ania-borowiec · 2025-09-03T18:59:52Z

+The integration tests listed above should cover all the scenarios,
+so implementing e2e tests is a redundant effort.
+
+### Graduation Criteria


Actually what is the intended behavior when the feature is disabled? Using the new API for PreEnqueue, but not filling the new optional field?

The PreEnqueue would use a new interface, but the reported message will be ignored, i.e. no API call dispatched.

ania-borowiec · 2025-09-03T19:04:30Z

+
+During the beta period, the feature gate `SchedulerPreEnqueuePodStatus` is enabled by default,
+so users don't need to opt in. This is a purely in-memory feature for the kube-scheduler,
+so no special actions are required outside the scheduler.


Except that out-of-tree plugin code needs to be updated to use the new API.
Unless we don't consider them "outside of scheduler"?

Here, in an upgrade strategy, we assume that the scheduler has a new version, so its plugins use the updated PreEnqueue interface. Enablement/Disablement of the feature gate doesn't impact the plugins, as they are already migrated.

dom4ha

Very well written and very valuable feature. I wonder how Workload Aware Scheduling may change the approach if we could write workload status instead for repeating pods. Maybe we could even write status to a hypothetical PodGroups?

dom4ha · 2025-09-03T22:09:11Z

+Currently, when a `PreEnqueue` plugin rejects a Pod, this decision is not communicated back to the user
+via the Pod's status. The Pod simply remains `Pending`, leaving the user to manually inspect scheduler logs
+or other components to diagnose the issue. This lack of feedback is particularly problematic for plugins
+that operate transparently to the user, as the reason for the delay is completely hidden.


What does it mean exactly?

Do you have any specific sentence in mind? In general, this paragraph describes the current behavior, where we don't communicate the PreEnqueue rejection to the users.

Only for Scheduling Gates we do that through the kube-apiserver - it knows that the pod would be rejected, because it infer that from a pod spec.

Maybe I should rephrase the sentence about operating transparently. I meant that other plugins, like DRA, reject the Pod based on some more indirect rules, which can't be easily inferred from a Pod.

ah, right, I highlighted "plugins that operate transparently to the user", but the GitHub does not reflect that.

maybe rephrase it?

dom4ha · 2025-09-03T22:17:32Z

+(e.g., by returning an empty message described above).
+
+- **Pros:**
+  - Reduces API server load and potential "noise" for plugins that reject pods for very short,


Cannot we mitigate this problem by introducing some sort of delay or rate limiting (for the number of PreEnque updates)? Can document such a possibility with your assessment whether it does or does not make sense?

I'll take that into consideration and mention in the KEP. We have some kind of rate limiting, as we limit the number of dispatched goroutines in the API dispatcher. I'm not sure how the delay could help here.

Some more advanced mechanisms could be a part of the Asynchronous API calls extension.

Delay can help to not report things which have a follow up call clearing the message, so only the prolonged states would be reflected.

Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want. Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?

Even if we have such a delay mechanism, how should we set the delay so that it is correct? If it is too low, we may send irrelevant messages (as in the case of DRA or async preemption) and increase the load on the kube-apiserver. If it is too high, we will delay the status update, as a result reducing the possibility of debugging and the purpose for which we want to have this feature.

In addition, we must remember about our current plugins:

SchedulingGates — its status is reported by kube-apiserver. We could move it to kube-scheduler, but this would involve unnecessary API calls that can be avoided by keeping the current behavior in kube-apiserver. Also, keep in mind that this plugin has its own status reason (SchedulingGated), which would have to be differentiated somehow if we move it to kube-scheduler.

DefaultPreemption — I believe we should rely on the status returned by PostFilter, so it should not be overwritten with the less informative PreEnqueue message.

DynamicResources — as mentioned, they may want to decide what and when they want to report.

I thought "delay" here just meant making api calls "asynchronously".

and,

Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want.

We do. How many API calls we make should be handled in a single place, which is in the framework side. We shouldn't rely plugins not to make too many API calls.

Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?

Plugins decide what to report in the messages, and the framework should decide whether it can ship these messages or not. Of course, plugins should make their messages simple so that they can help the framework to reduce the total number of API calls.
But, eventually the framework should decide whether the scheduler can deliver the update or not, based on how many API calls it has in the queue.

we may send irrelevant messages (as in the case of DRA or async preemption)

We must suggest the messages from PreEnqueue to be simple enough. Then, even if a message content is not very important like "The preemption is on-going and the pod is rejected", the update goes only once and not more (The framework should ignore API calls that won't change anything).

Overall, I still don't understand why we need to make the reporting opt-in. Repeating the same comment here as well though, I believe rather all plugins must report something to users in all rejections. Even if a rejection is supposed to be solved soon-ish (like the async preemption), that isn't a valid reason to hide why a pod is being blocked.

p.s., The scheduler doesn't show any message from PostFilter, does it?

I was thinking for some time about the most feasible approach (Alternatives helped me). I think just the approach with delays ("Implicit + Delayed (No opt-out)") might be the best and simple solution. So:

All plugins would provide a message via fwk.Status. There won't be an option to opt out.

Before sending the message we use the LastPreEnqueueRejectionMessage to verify the message is new. If it's new, we send the call to update.

We send the delayed async API call to patch the status with new message. The number of seconds could be just hardcoded.

For SchedulingGates, that report the message via kube-apiserver we have two options:

Remove this behavior from the apiserver and keep the kube-scheduler only. However, we would need to send a correct reason for that plugin.

Keep the apiserver reporting path, but unify the message with SchedulingGates plugin. This way, LastPreEnqueueRejectionMessage check would skip the call, if the status was already populated by the apiserver.

I'll update KEP when I have time

dom4ha · 2025-09-03T22:19:04Z

+  - Allows plugins to opt out of reporting by returning an empty message.
+
+- **Cons:**
+  - Requires a breaking change to the `PreEnqueue` plugin interface.


Can we use some standard message for beta to not introduce a breaking change? Providing info that a pod waits and which plugin blocks it should already improve the current state. Also we could prevent plugin developers from creating non-parsable messages by blocking variable messages. We could reevaluate it before moving to GA.

We shouldn't introduce such changes just before moving to GA, but rather do so during the beta phase. Keep in mind that v1.35 will include the final changes related to moving interfaces to staging, so plugin developers will have to change their plugin interfaces anyway.

dom4ha · 2025-09-03T22:25:21Z

+  - Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.
+
+- **Cons:**
+  - Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status


Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

dom4ha · 2025-09-03T22:28:52Z

+  - Provides a clear, logical progression for users to follow.
+
+- **Cons:**
+  - A stale message could be displayed if a Pod is first rejected by a reporting plugin (Plugin A)


Why the message cannot be updated once it happens?

The reason is in the Actively clearing the message alternative below:

This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remains Pending for another, unreported reason.

Moreover, we need to consider the number of API calls we would like to send. I believe that if we can omit some of them, we should do so to mitigate the performance impact of this feature.

If we had delay that addresses the problem of noice, we could require setting status message for all plugins (even default one "Blocked by plugin X"). Of course clearing would not have any delay and would cancel setting a message if it's no longer relevant. Also new message would instantly replace the previous one, but would be delayed as well so we increase likelihood it's also canceled (no unnecessary noice).

I barely remember, but we had some argument against delaying api-calls, but maybe it was only for setting nominations, which are time critical. In this case it should be rather safe.

+1 for default messages "blocked by plugin X" and to clearing messages / updating them with another plugin

sanposhiho

I'm glad we can finally start this work, thanks to the async API call feature :)
I just skimmed through, and left some initial comments.

sanposhiho · 2025-09-06T08:14:11Z

+  - Use the raw status message: This is simpler as it requires no interface changes.
+    However, these messages are often not written well for end-users.
+    It also makes it difficult for a plugin to conditionally opt out of reporting a status.


Hmm, actually this alternative looks better to me because that would match with messages from other extension points: we're using the messages from the statuses returned from Filter etc already.

It also makes it difficult for a plugin to conditionally opt out of reporting a status.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

And, even if you disagree with me and we do want to implement such opt-out behavior, I'd like to just extend the semantics of framework.Status. e.g., if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

we're using the messages from the statuses returned from Filter etc already

I agree that it would be more consistent with the actual plugin's implementation, but if we want to have an opt-out behavior, I'd consider having a new field.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

From the In-Tree Plugin Changes section:

SchedulingGates: Will always return an empty message, as its status will continue to be set by the kube-apiserver.

DefaultPreemption: Will always return an empty message, as its state is transient
and the Unschedulable status from the PostFilter stage is more descriptive.

DynamicResources: Will be updated to return a descriptive message when a ResourceClaim is missing.
It can also return a nil result to avoid reporting transient delays,
for example when waiting for a ResourceClaim to become available in the scheduler's cache.

Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

I am not in favor of this approach. If we do this, it will affect logging (we would like to register why plugin rejected the pod, but might not want to display that to user immediately) and make it harder to manage by plugin developers - the interface would be indirect, so they would have to rely on and remember that the empty reason has its own behavior. Introducing a direct field in PreEnqueueResult would make it more direct.

I have impression that all reasons to not report is to save on an api-server traffic except of "SchedulingGates" which I may not understand well.

Delaying message can reduce a traffic generated by transient issues. I'm only worried that delayed message may overwrite some other message that was set in the meantime by someone else. But all message changes initiated by the kube-scheduler should cancel any previously pending (waiting) one and replace it with a new one or clear it.

I see you updated KEP with good arguments against the delay. Some alternatives are to

make the delay optional (but not sure if it's worth implementing it)

update workload object status instead of blocked pods (once the generic workload object is defined, it's in the design phase now). Lack of a delay still makes generating noice for tancient issues, although on just one object only.

detect that objects are waiting on PreEnqueue for longer period of time. It's like double checking whether the condition still holds before applying a status. Still sounds complicated.

From the In-Tree Plugin Changes section:
...
Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

Hmm, even for the default preemption / dynamic resources, I feel like it's better to report something to indicate the pods are not retried on purpose since today that is not visible at all. As Dominik mentioned, the delay by the async API call mechanism can mitigate the concern of too-many-apiserver calls. And, if we implement this idea that you mentioned in another thread of mine, I don't think there would be too many API calls triggered by those PreEnqueue (assuming the plugins make the same simple message as possible at every rejection)

Good point, I will add a larger section about the delay, and we will be able to make a decision on that.

sanposhiho · 2025-09-06T08:23:47Z

+(e.g., by returning an empty message described above).
+
+- **Pros:**
+  - Reduces API server load and potential "noise" for plugins that reject pods for very short,


To mitigate it, can we just skip an error reporting if that's identical with the existing one?
e.g., if the pod keeps being rejected by the preemption PreEnqueue plugin, it will get a message like This pod is not schedulable because waiting for the preemption to be completed for the first time, but the preemption PreEnqueue plugin will (i.e. should) return the same common message and then the scheduling framework doesn't update the pod with the message every time PreEnqueue rejects the same pod with the same reason.

We have that already explained in the Preventing redundant API calls section of Design details:

To prevent API server flooding when a pod is rejected for the same reason repeatedly, message caching will be introduced.

A new field will be added to the PodInfo struct within the queue, for example, lastPreEnqueueRejectionMessage.

Before dispatching a status patch, the scheduler will compare the new rejection message
with the cached lastPreEnqueueRejectionMessage.

The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one. (...)

sanposhiho · 2025-09-06T08:27:27Z

+  - Introduces a new value to the API, which must be documented and maintained.
+
+- **Alternatives Considered:**
+  - PodReasonUnschedulable (`Unschedulable`): Reusing this reason would be somewhat incorrect,


I prefer not to add a new one, prefer to use the existing condition. I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.
Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition?

as the Pod has not failed a scheduling attempt (i.e., it was never evaluated against nodes).

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition

Or break them, if they rely on the fact the pod was processed and rejected by the filters.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA, because such rejections are not related to the nodes (capacity etc.), but to the pod itself. That can't be fixed by the CA's node provisioning. So, using the Unschedulable status would force the CA to process such impossible to fix pods or try to filter them out somehow.

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

And such behavior could be eventually migrated to the PreEnqueue.

I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.

The PreEnqueue rejection is different from the Unschedulable rejection as it shows that the pod is missing something, so I think having the new reason is okay.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA

This is a good point, and I'm convinced.

macsko · 2025-09-10T07:02:59Z

I've updated the KEP with the alternatives mentioned in the comments.

ania-borowiec · 2025-09-10T08:43:39Z

    However, it might create significant API churn for plugins that reject pods frequently and for short durations,
    negatively impacting performance and UX.
+  - Mandatory reporting with a delay (cooldown period): This approach attempts to reduce API churn by waiting for a brief,
+    configurable delay before reporting a rejection. While better than immediate mandatory reporting, this has some flaws:


who should configure the delay?

That's a good question. We could have a few options:

Hardcoded delay per API call, e.g., delay only the PreEnqueue status patch call by a certain time

Configurable delay through scheduler's config - might be hard to implement clearly

Per-Plugin delay, passed via PreEnqueueResult - would move the responsibility to the plugin developers, which might not know well what is the correct delay anyway

ania-borowiec · 2025-09-10T08:59:17Z

+    the scheduler could generate a generic message (e.g., "Pod is blocked on PreEnqueue by plugin: Plugin B").
+    This has a two flaws:
+    - While it identifies the blocking plugin, it doesn't explain why the Pod was blocked, which is the essential information for debugging.
+      A generic message isn't a significant improvement over a stale (but potentially actionable) message.


In my opinion stale != actionable 🙂 It was actionable, but now is no longer because it's stale.

Stale is still actionable (someone can try to fix the issue or just wait for it to resolve), compared to the useless (slightly informative) message that is not actionable at all.

ania-borowiec · 2025-10-01T21:01:41Z

+    return
 	}
-	// Enqueue a PodStatusPatch call to clear the NotReadyForScheduling condition.
+	// Enqueue a *delayed* PodStatusPatch call to clear the NotReadyForScheduling condition.


Actually... do we consider clearing the message without delay? Sending the "update pod" immediately, without waiting the 5 seconds?
Or do we still want to wait 5 sec before issuing it, in case something changes for this pod?

Clearing the message is not that relevant and I don't think we need to send it without a delay. When the pod passes PreEnqueue it's likely it will be soon tried and either scheduled (bound) or stay unschedulable (with another status update).

ania-borowiec · 2025-10-01T21:13:38Z

+with a sub-variant for the Implicit Interface model (with or without an opt-out).
+
+The five models are:
+1. **Explicit + Immediate (KEP Proposal):** A new `PreEnqueueResult` is introduced.


KEP mentions a 5 second delay, I think it makes "explicit + delayed" the KEP proposal model?

Other than that - yes, we are on the same page. Thank you for putting it all together in a clear and comprehensible form!

The KEP proposal is now the Implicit + Delayed (No opt-out). I'll update the Alternatives, because it's an outdated section now

sanposhiho

Looks awesome. Very close to /approve for me.

sanposhiho · 2025-10-06T02:27:35Z

+
+2. The impact on scheduling latency and throughput under heavy load will be measured using performance tests.
+
+3. The delaying mechanism is expected to reduce the load on kube-apiserver by canceling


Also, along with this, we will recommend users/maintainers make error messages as consistent as possible to reduce the number of necessary updates. e.g., instead of saying this pod is blocked because it's waiting for the preemption for pod1, pod2... to be completed, say this pod is blocked because it's waiting for the preemption for some victim pods to be completed. (The former has to be updated every time each pod deletion is done while the latter is consistent)

Good point, updated

ania-borowiec · 2025-10-09T12:30:58Z

/lgtm

sanposhiho · 2025-10-11T08:41:49Z

+Since the `SchedulerAsyncAPICalls` feature was disabled by default in v1.34,
+successfully enabling the `SchedulerPreEnqueuePodStatus` feature in v1.35 will depend on re-enabling the `SchedulerAsyncAPICalls` feature.


Right, so does it mean we start SchedulerPreEnqueuePodStatus from beta/enabled by default, but it's actually disabled by default because SchedulerAsyncAPICalls is disabled by default? Or, should we also disable SchedulerPreEnqueuePodStatus by default?

More generally, what if SchedulerPreEnqueuePodStatus is enabled while SchedulerAsyncAPICalls is disabled? Will we just implicitly disable SchedulerPreEnqueuePodStatus in that case? Or should we crash the scheduler at the startup to tell the misconfiguration to users ?

Feature gate dependencies can be now codified (see kubernetes/kubernetes#133697). Using that, enabling only SchedulerPreEnqueuePodStatus won't be possible.

If we are unable to re-enable async API calls in v1.35, the reflect PreEnqueue feature will remain disabled by default (in alpha or beta).

If we are unable to re-enable async API calls in v1.35, the reflect PreEnqueue feature will remain disabled by default (in alpha or beta).

Ok - can you update the explanation a bit to clarify this in the beta requirement section? like

- Implement the feature behind a feature gate (`SchedulerPreEnqueuePodStatus`), enabled by default. + Implement the feature behind a feature gate (`SchedulerPreEnqueuePodStatus`). We might start this gate disabled by default, depending on whether we reenable the SchedulerAsyncAPICalls gate or not.

wojtek-t

Few comments from the PRR perspective - when addressed this looks reasonable to me.

wojtek-t · 2025-10-13T08:16:29Z

+
+As a data scientist running a distributed training job, I submit a batch of Pods that must be scheduled together (a "gang").
+The custom gang scheduling logic, using a `PreEnqueue` plugin, blocks all these Pods from entering
+the queue until there are enough resources for all of them to pass the scheduling.


In the first implementation, those won't be blocked in PreEnqueue if resources are missing. The goal is to block them on PreEnqueue only until we see enough pods so that gang in theory can be scheduled. Determining if there are enough resoures for them will happen after PreEnqueue already.

@dom4ha

Correct, PreEnqueue waits until the workload object referred by pods appears and pods reach quorum.

Right, this part was outdated (written before the gang scheduling KEP proposal). I updated it now

wojtek-t · 2025-10-13T08:17:39Z

+The lack of a status condition is no worse than the current behavior.
+Furthermore, any subsequent event that causes the Pod to be re-evaluated by the `PreEnqueue` plugins
+will trigger a retry of the status patch. Exploring a more robust retry mechanism
+within the asynchronous API calls feature itself would be a beneficial future enhancement.


... but we don't consider it a blocker for this KEP.

Right, added that

wojtek-t · 2025-10-13T08:20:18Z

+   is left intact to avoid any performance regression in that scenario.
+
+5. Future work: As the scheduler evolves, introducing batched status updates
+   could further mitigate the impact of many simultaneous rejections.


Batching helps if we update the same pod multiple times with different conditions. I'm not sure this is really the primary concern.

Updating large number of pods (e.g. large gangs) is imho and for that batching will not help (we will not support cross-object batching at the API level). But OTOH we are marking those pods as pending anyway in that case - so if that happens once, it probably won't change anything...

I meant some type of batching per a group of pods. I changed that point to mention that we could have some per-workload status updates in the future instead

wojtek-t · 2025-10-13T08:28:14Z

+3. If the checks pass, the scheduler proceeds:
+   - It immediately updates its internal state by setting `pInfo.LastPreEnqueueRejectionMessage` to the new message.
+   - It constructs the condition to patch and enqueues it with a delay into the asynchronous API dispatcher.
+   - An `Event` is emitted for the Pod with the reason `NotReadyForScheduling` and the rejection message.


IIUC this is regular k8s Event object right. Do we batch those too somehow?
In other words - the exact same risks that we have pods are true for Events too and I don't see them being addressed.

That's a good point. Indeed, sending an event requires making an API call through some more optimized path. Anyway, I changed this part to emphasize the event will be sent together with the API call to patch the Pod, so the risks should be mitigated.

wojtek-t · 2025-10-13T08:30:37Z

+
+Yes.
+
+- API call type: `PATCH` on `pods/status`.


What about new Events?

Right, added

wojtek-t · 2025-10-13T13:34:11Z

This looks good for PRR.

/approve PRR

sanposhiho · 2025-10-14T11:31:23Z

/approve
/hold in case @dom4ha wants to take a look. Feel free to unhold if not.

k8s-ci-robot · 2025-10-14T11:31:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [macsko,sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dom4ha · 2025-10-14T14:37:17Z

Looks good, thanks @macsko
/unhold

dom4ha · 2025-10-15T07:58:27Z

/lgtm

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 1, 2025

k8s-ci-robot requested review from dom4ha and palnabarun September 1, 2025 13:01

github-project-automation Bot added this to SIG Scheduling Sep 1, 2025

github-project-automation Bot moved this to Needs Triage in SIG Scheduling Sep 1, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 1, 2025

k8s-ci-robot assigned dom4ha and sanposhiho Sep 1, 2025

k8s-ci-robot requested a review from ania-borowiec September 1, 2025 13:01

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2025

macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from e4698c8 to c6ddb75 Compare September 1, 2025 13:10

macsko mentioned this pull request Sep 2, 2025

Reflect PreEnqueue rejections in pod status #5501

Open

4 tasks

macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from c6ddb75 to 63fca5d Compare September 2, 2025 09:15

KEP-5501: Reflect PreEnqueue rejections in Pod status

4e3d0ab

macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from 63fca5d to 4e3d0ab Compare September 2, 2025 09:15

ania-borowiec reviewed Sep 3, 2025

View reviewed changes

dom4ha reviewed Sep 3, 2025

View reviewed changes

Apply comments and fix the typos

0434503

ania-borowiec reviewed Sep 5, 2025

View reviewed changes

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated

sanposhiho reviewed Sep 6, 2025

View reviewed changes

Extend alternatives of the proposal

47130eb

helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025

ania-borowiec reviewed Sep 10, 2025

View reviewed changes

Add Detailed comparison for five design alternatives section

98ba256

sanposhiho mentioned this pull request Oct 1, 2025

KEP-4671 : Gang Scheduling #5558

Merged

Update toc

0ac64d4

ania-borowiec reviewed Oct 1, 2025

View reviewed changes

Update alternatives section

c981e80

wojtek-t reviewed Oct 3, 2025

View reviewed changes

Comment thread keps/prod-readiness/sig-scheduling/5501.yaml Outdated

wojtek-t self-assigned this Oct 3, 2025

Update PRR approver

0ca3c74

sanposhiho reviewed Oct 6, 2025

View reviewed changes

Apply comments

60a60a4

k8s-ci-robot assigned ania-borowiec Oct 9, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2025

dom4ha reviewed Oct 9, 2025

View reviewed changes

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md

sanposhiho reviewed Oct 11, 2025

View reviewed changes

wojtek-t reviewed Oct 13, 2025

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2025

Apply comments

6eea5f6

macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from a703582 to 6eea5f6 Compare October 13, 2025 09:46

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025

k8s-ci-robot merged commit 7fb57da into kubernetes:master Oct 15, 2025
4 checks passed

k8s-ci-robot added this to the v1.35 milestone Oct 15, 2025

github-project-automation Bot moved this from Needs Review to Done in SIG Scheduling Oct 15, 2025


		2. The impact on scheduling latency and throughput under heavy load will be measured using performance tests.

		3. The delaying mechanism is expected to reduce the load on kube-apiserver by canceling

		Since the `SchedulerAsyncAPICalls` feature was disabled by default in v1.34,
		successfully enabling the `SchedulerPreEnqueuePodStatus` feature in v1.35 will depend on re-enabling the `SchedulerAsyncAPICalls` feature.

Conversation

macsko commented Sep 1, 2025

Uh oh!

macsko commented Sep 1, 2025

Uh oh!

macsko commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ania-borowiec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ania-borowiec Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dom4ha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

macsko commented Sep 1, 2025 •

edited

Loading

ania-borowiec Sep 5, 2025 •

edited

Loading

macsko Sep 4, 2025 •

edited

Loading

macsko Sep 4, 2025 •

edited

Loading