Skip to content

KEP-5501: Reflect PreEnqueue rejections in Pod status#5510

Merged
k8s-ci-robot merged 10 commits intokubernetes:masterfrom
macsko:kep_5501_reflect_preenqueue_rejections_in_pod_status
Oct 15, 2025
Merged

KEP-5501: Reflect PreEnqueue rejections in Pod status#5510
k8s-ci-robot merged 10 commits intokubernetes:masterfrom
macsko:kep_5501_reflect_preenqueue_rejections_in_pod_status

Conversation

@macsko
Copy link
Copy Markdown
Member

@macsko macsko commented Sep 1, 2025

  • One-line PR description: Add a KEP-5501
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 1, 2025
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling Sep 1, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 1, 2025
@macsko
Copy link
Copy Markdown
Member Author

macsko commented Sep 1, 2025

/assign @dom4ha @sanposhiho
/cc @ania-borowiec

@macsko
Copy link
Copy Markdown
Member Author

macsko commented Sep 1, 2025

/hold
To make sure it's approved by SIG Scheduling leads

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 1, 2025
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from e4698c8 to c6ddb75 Compare September 1, 2025 13:10
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from c6ddb75 to 63fca5d Compare September 2, 2025 09:15
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from 63fca5d to 4e3d0ab Compare September 2, 2025 09:15
Copy link
Copy Markdown
Contributor

@ania-borowiec ania-borowiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.

- **Cons:**
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it could be inconsistency, but is that any harm really? If there is a need to display a human-readable message to the user, then an empty rejection message can be interpreted as some default message (e.g. saying something about possible transient errors or other potential vague reasons).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

Yes. See the DRA case (kubernetes/kubernetes#129698 (comment)) and the explanation in the How should plugins provide the status message? section:

This flexibility is important for plugins that wish to avoid reporting transient rejections.
For example, the DynamicResources plugin might observe a rejection because the scheduler processes a Pod faster
than a ResourceClaim becomes visible through the watcher (see a comment).
In such cases where the condition is expected to resolve in seconds, populating the status would be inappropriate noise.

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated
Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated

- **Alternatives Considered:**
- Actively clearing the message: The scheduler could clear the condition if,
on a subsequent check, the original rejecting plugin no longer rejects the pod.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would simply checking the diff between states "before" and "now" be enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYM?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said the condition would be cleared if the original rejecting plugin no longer rejects the pod.
I am suggesting that it may be simpler than that: that updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updates should be sent whenever there is a difference between conditions in cache vs conditions after the current preenqueue run

But that looks like the current proposal, isn't it? The only case when we wouldn't do this is transition from Plugin A (that reported a message) to Plugin B (that didn't report).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my point. My thinking is that it's better to display some generic default message for failing Plugin B, than to keep displaying a no-longer-correct message saying that Plugin A failed.

- Actively clearing the message: The scheduler could clear the condition if,
on a subsequent check, the original rejecting plugin no longer rejects the pod.
This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remains `Pending` for another, unreported reason.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe displaying a default message to the user in case of an empty status (as mentioned in my comment above) would help solve this?
If our goal is to make the messaging meaningful and informative to users, then an explanatory default message could possibly help make things clearer for them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if some default message would be that informative. It won't answer clearly why the pod is on hold. I believe, the slate message might be better than sending such placeholder. And, it would reduce the number of API calls sent.

Copy link
Copy Markdown
Contributor

@ania-borowiec ania-borowiec Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I wrote above - my personal opinion is that being not-very-informative is better than being misleading.
I might be wrong, but I assume that if a user looks at the state message, they are inclined to investigate the reason behind the pending state, e.g. check if indeed some resource is not ready. And if they see that the resource is ready and the pod still keeps waiting for it, they might consider this a bug in k8s and report it, causing more work triaging bugs and poor user experience.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that in the case of the Unschedulable reason, when the pod is rejected by filters, we don't remove this message. It is displayed in the Pod status until it's retried. I'm not sure if we can treat a similar mechanism as misleading in the case of PreEnqueue.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a fair point, but Unschedulable reason reflects that the pod is in unschedulable queue (or pending a scheduling cycle) and the scheduler does not yet know if the pod will become schedulable.
In case of PreEnqueue the scheduler does know that the specific PreEnqueue plugin is no longer failing, but in the proposed scenario the scheduler doesn't display this knowledge to the user.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but from the user's POV, the Unschedulable status may be outdated, especially when the scheduling queue contains plenty of pods. So, I expect users to be aware that the longer the status remains unchanged, the more likely it is to be outdated.

Obviously, the NonReadyForScheduling reason could have its own semantic. I'm just still not sure what's the best choice here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sort of an "average scenario" where we could measure how often pods actually fail scheduling in preenqueue? Or would that look very different for various users?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the scenario, but:

  • For SchedulingGates, PreEnqueue fails until all the gates are removed (so could be multiple times per pod)
  • For DefaultPreemption, PreEnqueue fails until the API calls for all preemption's victims are made (this shouldn't take really long, so maybe 1 PreEnqueue failure per pod is a good assumption)
  • For DRA, PreEnqueue fails until all the ResourceClaims are available. Usually, there is one failure per pod, unless someone forgets to create the ResourceClaim 😃

The integration tests listed above should cover all the scenarios,
so implementing e2e tests is a redundant effort.

### Graduation Criteria
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually what is the intended behavior when the feature is disabled? Using the new API for PreEnqueue, but not filling the new optional field?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PreEnqueue would use a new interface, but the reported message will be ignored, i.e. no API call dispatched.


During the beta period, the feature gate `SchedulerPreEnqueuePodStatus` is enabled by default,
so users don't need to opt in. This is a purely in-memory feature for the kube-scheduler,
so no special actions are required outside the scheduler.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except that out-of-tree plugin code needs to be updated to use the new API.
Unless we don't consider them "outside of scheduler"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, in an upgrade strategy, we assume that the scheduler has a new version, so its plugins use the updated PreEnqueue interface. Enablement/Disablement of the feature gate doesn't impact the plugins, as they are already migrated.

Copy link
Copy Markdown
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very well written and very valuable feature. I wonder how Workload Aware Scheduling may change the approach if we could write workload status instead for repeating pods. Maybe we could even write status to a hypothetical PodGroups?

Currently, when a `PreEnqueue` plugin rejects a Pod, this decision is not communicated back to the user
via the Pod's status. The Pod simply remains `Pending`, leaving the user to manually inspect scheduler logs
or other components to diagnose the issue. This lack of feedback is particularly problematic for plugins
that operate transparently to the user, as the reason for the delay is completely hidden.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean exactly?

Copy link
Copy Markdown
Member Author

@macsko macsko Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any specific sentence in mind? In general, this paragraph describes the current behavior, where we don't communicate the PreEnqueue rejection to the users.

Only for Scheduling Gates we do that through the kube-apiserver - it knows that the pod would be rejected, because it infer that from a pod spec.

Maybe I should rephrase the sentence about operating transparently. I meant that other plugins, like DRA, reject the Pod based on some more indirect rules, which can't be easily inferred from a Pod.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, right, I highlighted "plugins that operate transparently to the user", but the GitHub does not reflect that.

maybe rephrase it?

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated
(e.g., by returning an empty message described above).

- **Pros:**
- Reduces API server load and potential "noise" for plugins that reject pods for very short,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we mitigate this problem by introducing some sort of delay or rate limiting (for the number of PreEnque updates)? Can document such a possibility with your assessment whether it does or does not make sense?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take that into consideration and mention in the KEP. We have some kind of rate limiting, as we limit the number of dispatched goroutines in the API dispatcher. I'm not sure how the delay could help here.

Some more advanced mechanisms could be a part of the Asynchronous API calls extension.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay can help to not report things which have a follow up call clearing the message, so only the prolonged states would be reflected.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want. Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?

Even if we have such a delay mechanism, how should we set the delay so that it is correct? If it is too low, we may send irrelevant messages (as in the case of DRA or async preemption) and increase the load on the kube-apiserver. If it is too high, we will delay the status update, as a result reducing the possibility of debugging and the purpose for which we want to have this feature.

In addition, we must remember about our current plugins:

  • SchedulingGates — its status is reported by kube-apiserver. We could move it to kube-scheduler, but this would involve unnecessary API calls that can be avoided by keeping the current behavior in kube-apiserver. Also, keep in mind that this plugin has its own status reason (SchedulingGated), which would have to be differentiated somehow if we move it to kube-scheduler.
  • DefaultPreemption — I believe we should rely on the status returned by PostFilter, so it should not be overwritten with the less informative PreEnqueue message.
  • DynamicResources — as mentioned, they may want to decide what and when they want to report.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought "delay" here just meant making api calls "asynchronously".

and,

Introducing a delay shifts responsibility from the plugin to the framework. However, I'm not sure if that's really what we want.

We do. How many API calls we make should be handled in a single place, which is in the framework side. We shouldn't rely plugins not to make too many API calls.

Why shouldn't the plugin, which has the most information about why it rejected the Pod, decide what to do with the message?

Plugins decide what to report in the messages, and the framework should decide whether it can ship these messages or not. Of course, plugins should make their messages simple so that they can help the framework to reduce the total number of API calls.
But, eventually the framework should decide whether the scheduler can deliver the update or not, based on how many API calls it has in the queue.

we may send irrelevant messages (as in the case of DRA or async preemption)

We must suggest the messages from PreEnqueue to be simple enough. Then, even if a message content is not very important like "The preemption is on-going and the pod is rejected", the update goes only once and not more (The framework should ignore API calls that won't change anything).


Overall, I still don't understand why we need to make the reporting opt-in. Repeating the same comment here as well though, I believe rather all plugins must report something to users in all rejections. Even if a rejection is supposed to be solved soon-ish (like the async preemption), that isn't a valid reason to hide why a pod is being blocked.

p.s., The scheduler doesn't show any message from PostFilter, does it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking for some time about the most feasible approach (Alternatives helped me). I think just the approach with delays ("Implicit + Delayed (No opt-out)") might be the best and simple solution. So:

  • All plugins would provide a message via fwk.Status. There won't be an option to opt out.
  • Before sending the message we use the LastPreEnqueueRejectionMessage to verify the message is new. If it's new, we send the call to update.
  • We send the delayed async API call to patch the status with new message. The number of seconds could be just hardcoded.

For SchedulingGates, that report the message via kube-apiserver we have two options:

  • Remove this behavior from the apiserver and keep the kube-scheduler only. However, we would need to send a correct reason for that plugin.
  • Keep the apiserver reporting path, but unify the message with SchedulingGates plugin. This way, LastPreEnqueueRejectionMessage check would skip the call, if the status was already populated by the apiserver.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update KEP when I have time

- Allows plugins to opt out of reporting by returning an empty message.

- **Cons:**
- Requires a breaking change to the `PreEnqueue` plugin interface.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some standard message for beta to not introduce a breaking change? Providing info that a pod waits and which plugin blocks it should already improve the current state. Also we could prevent plugin developers from creating non-parsable messages by blocking variable messages. We could reevaluate it before moving to GA.

Copy link
Copy Markdown
Member Author

@macsko macsko Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't introduce such changes just before moving to GA, but rather do so during the beta phase. Keep in mind that v1.35 will include the final changes related to moving interfaces to staging, so plugin developers will have to change their plugin interfaces anyway.

- Gives plugin developers the flexibility to decide if a status update is valuable for their specific logic.

- **Cons:**
- Could lead to perceived inconsistency, where some `PreEnqueue` rejections appear on the Pod status
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this point about no reporting the fact a pod waits or just skipping adding a custom message?

- Provides a clear, logical progression for users to follow.

- **Cons:**
- A stale message could be displayed if a Pod is first rejected by a reporting plugin (Plugin A)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the message cannot be updated once it happens?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is in the Actively clearing the message alternative below:

This would provide a more accurate real-time status but could cause confusion
if the condition appears and disappears while the Pod remains Pending for another, unreported reason.

Moreover, we need to consider the number of API calls we would like to send. I believe that if we can omit some of them, we should do so to mitigate the performance impact of this feature.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had delay that addresses the problem of noice, we could require setting status message for all plugins (even default one "Blocked by plugin X"). Of course clearing would not have any delay and would cancel setting a message if it's no longer relevant. Also new message would instantly replace the previous one, but would be delayed as well so we increase likelihood it's also canceled (no unnecessary noice).

I barely remember, but we had some argument against delaying api-calls, but maybe it was only for setting nominations, which are time critical. In this case it should be rather safe.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for default messages "blocked by plugin X" and to clearing messages / updating them with another plugin

Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated
Comment thread keps/sig-scheduling/5501-reflect-preenqueue-rejections-in-pod-status/README.md Outdated
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad we can finally start this work, thanks to the async API call feature :)
I just skimmed through, and left some initial comments.

Comment on lines +175 to +177
- Use the raw status message: This is simpler as it requires no interface changes.
However, these messages are often not written well for end-users.
It also makes it difficult for a plugin to conditionally opt out of reporting a status.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, actually this alternative looks better to me because that would match with messages from other extension points: we're using the messages from the statuses returned from Filter etc already.

It also makes it difficult for a plugin to conditionally opt out of reporting a status.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, even if you disagree with me and we do want to implement such opt-out behavior, I'd like to just extend the semantics of framework.Status. e.g., if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're using the messages from the statuses returned from Filter etc already

I agree that it would be more consistent with the actual plugin's implementation, but if we want to have an opt-out behavior, I'd consider having a new field.

Do we need to allow plugins not to report the messages? That would just hide the pod stucking reason from users.

From the In-Tree Plugin Changes section:

  • SchedulingGates: Will always return an empty message, as its status will continue to be set by the kube-apiserver.
  • DefaultPreemption: Will always return an empty message, as its state is transient
    and the Unschedulable status from the PostFilter stage is more descriptive.
  • DynamicResources: Will be updated to return a descriptive message when a ResourceClaim is missing.
    It can also return a nil result to avoid reporting transient delays,
    for example when waiting for a ResourceClaim to become available in the scheduler's cache.

Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

if framework.Status has code == Unschedulable with reasons == nil, then we regard plugins want to reject this pod, but not report anything to users.

I am not in favor of this approach. If we do this, it will affect logging (we would like to register why plugin rejected the pod, but might not want to display that to user immediately) and make it harder to manage by plugin developers - the interface would be indirect, so they would have to rely on and remember that the empty reason has its own behavior. Introducing a direct field in PreEnqueueResult would make it more direct.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have impression that all reasons to not report is to save on an api-server traffic except of "SchedulingGates" which I may not understand well.

Delaying message can reduce a traffic generated by transient issues. I'm only worried that delayed message may overwrite some other message that was set in the meantime by someone else. But all message changes initiated by the kube-scheduler should cancel any previously pending (waiting) one and replace it with a new one or clear it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you updated KEP with good arguments against the delay. Some alternatives are to

  • make the delay optional (but not sure if it's worth implementing it)
  • update workload object status instead of blocked pods (once the generic workload object is defined, it's in the design phase now). Lack of a delay still makes generating noice for tancient issues, although on just one object only.
  • detect that objects are waiting on PreEnqueue for longer period of time. It's like double checking whether the condition still holds before applying a status. Still sounds complicated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the In-Tree Plugin Changes section:
...
Allowing not to report the message can reduce the traffic to kube-apiserver, when such operation is irrelevant/redundant.

Hmm, even for the default preemption / dynamic resources, I feel like it's better to report something to indicate the pods are not retried on purpose since today that is not visible at all. As Dominik mentioned, the delay by the async API call mechanism can mitigate the concern of too-many-apiserver calls. And, if we implement this idea that you mentioned in another thread of mine, I don't think there would be too many API calls triggered by those PreEnqueue (assuming the plugins make the same simple message as possible at every rejection)

Copy link
Copy Markdown
Member Author

@macsko macsko Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will add a larger section about the delay, and we will be able to make a decision on that.

(e.g., by returning an empty message described above).

- **Pros:**
- Reduces API server load and potential "noise" for plugins that reject pods for very short,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To mitigate it, can we just skip an error reporting if that's identical with the existing one?
e.g., if the pod keeps being rejected by the preemption PreEnqueue plugin, it will get a message like This pod is not schedulable because waiting for the preemption to be completed for the first time, but the preemption PreEnqueue plugin will (i.e. should) return the same common message and then the scheduling framework doesn't update the pod with the message every time PreEnqueue rejects the same pod with the same reason.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that already explained in the Preventing redundant API calls section of Design details:

To prevent API server flooding when a pod is rejected for the same reason repeatedly, message caching will be introduced.

  • A new field will be added to the PodInfo struct within the queue, for example, lastPreEnqueueRejectionMessage.
  • Before dispatching a status patch, the scheduler will compare the new rejection message
    with the cached lastPreEnqueueRejectionMessage.
  • The asynchronous call to the API server will only be dispatched if the status is new or different from the cached one. (...)

- Introduces a new value to the API, which must be documented and maintained.

- **Alternatives Considered:**
- PodReasonUnschedulable (`Unschedulable`): Reusing this reason would be somewhat incorrect,
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to add a new one, prefer to use the existing condition. I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.
Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition?

as the Pod has not failed a scheduling attempt (i.e., it was never evaluated against nodes).

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, keep using the same condition would help other external components (e.g., CA) since they won't need to support a new condition

Or break them, if they rely on the fact the pod was processed and rejected by the filters.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA, because such rejections are not related to the nodes (capacity etc.), but to the pod itself. That can't be fixed by the CA's node provisioning. So, using the Unschedulable status would force the CA to process such impossible to fix pods or try to filter them out somehow.

Today, it could happen that the scheduling cycle doesn't evaluate nodes, but reject a pod at PreFilter.

And such behavior could be eventually migrated to the PreEnqueue.

I feel like adding a new condition is rather confusing for users, especially those who are not very familiar with the scheduler.

The PreEnqueue rejection is different from the Unschedulable rejection as it shows that the pod is missing something, so I think having the new reason is okay.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was wrong with the CA use case and the PreEnqueue rejection shouldn't be processed by the CA

This is a good point, and I'm convinced.

@helayoty helayoty moved this from Needs Triage to Needs Review in SIG Scheduling Sep 9, 2025
@macsko
Copy link
Copy Markdown
Member Author

macsko commented Sep 10, 2025

I've updated the KEP with the alternatives mentioned in the comments.

However, it might create significant API churn for plugins that reject pods frequently and for short durations,
negatively impacting performance and UX.
- Mandatory reporting with a delay (cooldown period): This approach attempts to reduce API churn by waiting for a brief,
configurable delay before reporting a rejection. While better than immediate mandatory reporting, this has some flaws:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who should configure the delay?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. We could have a few options:

  • Hardcoded delay per API call, e.g., delay only the PreEnqueue status patch call by a certain time
  • Configurable delay through scheduler's config - might be hard to implement clearly
  • Per-Plugin delay, passed via PreEnqueueResult - would move the responsibility to the plugin developers, which might not know well what is the correct delay anyway

the scheduler could generate a generic message (e.g., "Pod is blocked on PreEnqueue by plugin: Plugin B").
This has a two flaws:
- While it identifies the blocking plugin, it doesn't explain why the Pod was blocked, which is the essential information for debugging.
A generic message isn't a significant improvement over a stale (but potentially actionable) message.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion stale != actionable 🙂 It was actionable, but now is no longer because it's stale.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale is still actionable (someone can try to fix the issue or just wait for it to resolve), compared to the useless (slightly informative) message that is not actionable at all.

return
}
// Enqueue a PodStatusPatch call to clear the NotReadyForScheduling condition.
// Enqueue a *delayed* PodStatusPatch call to clear the NotReadyForScheduling condition.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually... do we consider clearing the message without delay? Sending the "update pod" immediately, without waiting the 5 seconds?
Or do we still want to wait 5 sec before issuing it, in case something changes for this pod?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearing the message is not that relevant and I don't think we need to send it without a delay. When the pod passes PreEnqueue it's likely it will be soon tried and either scheduled (bound) or stay unschedulable (with another status update).

with a sub-variant for the Implicit Interface model (with or without an opt-out).

The five models are:
1. **Explicit + Immediate (KEP Proposal):** A new `PreEnqueueResult` is introduced.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KEP mentions a 5 second delay, I think it makes "explicit + delayed" the KEP proposal model?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that - yes, we are on the same page. Thank you for putting it all together in a clear and comprehensible form!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP proposal is now the Implicit + Delayed (No opt-out). I'll update the Alternatives, because it's an outdated section now

Comment thread keps/prod-readiness/sig-scheduling/5501.yaml Outdated
@wojtek-t wojtek-t self-assigned this Oct 3, 2025
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome. Very close to /approve for me.


2. The impact on scheduling latency and throughput under heavy load will be measured using performance tests.

3. The delaying mechanism is expected to reduce the load on kube-apiserver by canceling
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, along with this, we will recommend users/maintainers make error messages as consistent as possible to reduce the number of necessary updates. e.g., instead of saying this pod is blocked because it's waiting for the preemption for pod1, pod2... to be completed, say this pod is blocked because it's waiting for the preemption for some victim pods to be completed. (The former has to be updated every time each pod deletion is done while the latter is consistent)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated

@ania-borowiec
Copy link
Copy Markdown
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2025
Comment on lines +166 to +167
Since the `SchedulerAsyncAPICalls` feature was disabled by default in v1.34,
successfully enabling the `SchedulerPreEnqueuePodStatus` feature in v1.35 will depend on re-enabling the `SchedulerAsyncAPICalls` feature.
Copy link
Copy Markdown
Member

@sanposhiho sanposhiho Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so does it mean we start SchedulerPreEnqueuePodStatus from beta/enabled by default, but it's actually disabled by default because SchedulerAsyncAPICalls is disabled by default? Or, should we also disable SchedulerPreEnqueuePodStatus by default?

More generally, what if SchedulerPreEnqueuePodStatus is enabled while SchedulerAsyncAPICalls is disabled? Will we just implicitly disable SchedulerPreEnqueuePodStatus in that case? Or should we crash the scheduler at the startup to tell the misconfiguration to users ?

Copy link
Copy Markdown
Member Author

@macsko macsko Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature gate dependencies can be now codified (see kubernetes/kubernetes#133697). Using that, enabling only SchedulerPreEnqueuePodStatus won't be possible.

If we are unable to re-enable async API calls in v1.35, the reflect PreEnqueue feature will remain disabled by default (in alpha or beta).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are unable to re-enable async API calls in v1.35, the reflect PreEnqueue feature will remain disabled by default (in alpha or beta).

Ok - can you update the explanation a bit to clarify this in the beta requirement section? like

- Implement the feature behind a feature gate (`SchedulerPreEnqueuePodStatus`), enabled by default.
+ Implement the feature behind a feature gate (`SchedulerPreEnqueuePodStatus`). We might start this gate disabled by default, depending on whether we reenable the SchedulerAsyncAPICalls gate or not.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Copy Markdown
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments from the PRR perspective - when addressed this looks reasonable to me.


As a data scientist running a distributed training job, I submit a batch of Pods that must be scheduled together (a "gang").
The custom gang scheduling logic, using a `PreEnqueue` plugin, blocks all these Pods from entering
the queue until there are enough resources for all of them to pass the scheduling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first implementation, those won't be blocked in PreEnqueue if resources are missing. The goal is to block them on PreEnqueue only until we see enough pods so that gang in theory can be scheduled. Determining if there are enough resoures for them will happen after PreEnqueue already.

@dom4ha

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, PreEnqueue waits until the workload object referred by pods appears and pods reach quorum.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this part was outdated (written before the gang scheduling KEP proposal). I updated it now

The lack of a status condition is no worse than the current behavior.
Furthermore, any subsequent event that causes the Pod to be re-evaluated by the `PreEnqueue` plugins
will trigger a retry of the status patch. Exploring a more robust retry mechanism
within the asynchronous API calls feature itself would be a beneficial future enhancement.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but we don't consider it a blocker for this KEP.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added that

is left intact to avoid any performance regression in that scenario.

5. Future work: As the scheduler evolves, introducing batched status updates
could further mitigate the impact of many simultaneous rejections.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batching helps if we update the same pod multiple times with different conditions. I'm not sure this is really the primary concern.

Updating large number of pods (e.g. large gangs) is imho and for that batching will not help (we will not support cross-object batching at the API level). But OTOH we are marking those pods as pending anyway in that case - so if that happens once, it probably won't change anything...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant some type of batching per a group of pods. I changed that point to mention that we could have some per-workload status updates in the future instead

3. If the checks pass, the scheduler proceeds:
- It immediately updates its internal state by setting `pInfo.LastPreEnqueueRejectionMessage` to the new message.
- It constructs the condition to patch and enqueues it with a delay into the asynchronous API dispatcher.
- An `Event` is emitted for the Pod with the reason `NotReadyForScheduling` and the rejection message.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is regular k8s Event object right. Do we batch those too somehow?
In other words - the exact same risks that we have pods are true for Events too and I don't see them being addressed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Indeed, sending an event requires making an API call through some more optimized path. Anyway, I changed this part to emphasize the event will be sent together with the API call to patch the Pod, so the risks should be mitigated.


Yes.

- API call type: `PATCH` on `pods/status`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about new Events?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2025
@macsko macsko force-pushed the kep_5501_reflect_preenqueue_rejections_in_pod_status branch from a703582 to 6eea5f6 Compare October 13, 2025 09:46
@wojtek-t
Copy link
Copy Markdown
Member

This looks good for PRR.

/approve PRR

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2025
@sanposhiho
Copy link
Copy Markdown
Member

/approve
/hold in case @dom4ha wants to take a look. Feel free to unhold if not.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: macsko, sanposhiho, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Oct 14, 2025

Looks good, thanks @macsko
/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2025
@dom4ha
Copy link
Copy Markdown
Member

dom4ha commented Oct 15, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 15, 2025
@k8s-ci-robot k8s-ci-robot merged commit 7fb57da into kubernetes:master Oct 15, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 15, 2025
@github-project-automation github-project-automation Bot moved this from Needs Review to Done in SIG Scheduling Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants