Skip to content

fix(readiness): re-evaluate pod Ready condition in removeNotReadyKey#246

Merged
Syspretor merged 2 commits intosgl-project:mainfrom
sebest:fix/remove-not-ready-key-update-pod-ready-pr
Apr 10, 2026
Merged

fix(readiness): re-evaluate pod Ready condition in removeNotReadyKey#246
Syspretor merged 2 commits intosgl-project:mainfrom
sebest:fix/remove-not-ready-key-update-pod-ready-pr

Conversation

@sebest
Copy link
Copy Markdown
Contributor

@sebest sebest commented Mar 26, 2026

Summary

  • Fix asymmetry in removeNotReadyKey: After clearing all not-ready keys from a readiness gate condition, removeNotReadyKey now calls UpdatePodReadyCondition to re-evaluate the pod's Ready condition — matching what addNotReadyKey already does
  • Add LastTransitionTime to the Ready=False path in UpdatePodReadyCondition for consistency with the True path and Kubernetes API conventions
  • Add unit tests reproducing the bug and validating the fix

Root Cause

When addNotReadyKey set a readiness gate (e.g. InstancePodReady) to False, it called UpdatePodReadyCondition to set Ready=False. But when removeNotReadyKey later cleared all keys and set the gate back to True, it did not call UpdatePodReadyCondition, leaving Ready=False permanently.

In real clusters, the kubelet masks this by continuously re-evaluating readiness. In KWOK environments (no real kubelet), pods stay Ready=False forever, causing RBGSAs to report readyReplicas=0 and RBGs to show Ready=False.

Test plan

  • TestRemoveNotReadyKey_UpdatesPodReadyCondition — reproduces the exact bug (fails without the fix, passes with it)
  • TestAddRemoveNotReadyKey_MultipleKeys — verifies Ready only becomes True after all not-ready keys are removed
  • All existing tests pass: go test ./pkg/inplace/pod/...
  • Verified fix in KWOK kind cluster: all RBGs show Ready=True, RBGSAs show readyReplicas matching replicas

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a bug where Kubernetes Pods could remain in a Ready=False state indefinitely in environments lacking a kubelet (such as KWOK), even after their associated readiness gates were cleared. The core issue was an inconsistent handling of the PodReady condition between addNotReadyKey and removeNotReadyKey functions. The changes ensure that the PodReady condition is always accurately re-evaluated and updated when readiness gates are modified, thereby restoring proper pod readiness status and maintaining consistency with Kubernetes API standards.

Highlights

  • Asymmetry Fixed in Readiness Logic: The removeNotReadyKey function now correctly calls UpdatePodReadyCondition to re-evaluate the pod's Ready condition after clearing readiness gate keys, aligning its behavior with addNotReadyKey.
  • Consistent LastTransitionTime: LastTransitionTime is now explicitly set for Ready=False conditions within UpdatePodReadyCondition, ensuring consistency with the True path and Kubernetes API conventions.
  • New Unit Tests Added: Comprehensive unit tests have been introduced to reproduce the identified bug and validate the fix, covering scenarios with single and multiple readiness keys.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where the PodReady condition was not being correctly re-evaluated after readiness gates were updated, potentially leaving pods stuck in a Ready=False state. The changes include adding LastTransitionTime to the PodReady condition when it's set to false and explicitly calling UpdatePodReadyCondition after modifying readiness gates. Comprehensive unit tests have been added to validate this fix, including scenarios with multiple not-ready keys. A review comment suggests an improvement for readability and to avoid re-initializing the v1.PodCondition struct by directly modifying its fields.

Comment thread pkg/inplace/pod/pod_util.go
@sebest
Copy link
Copy Markdown
Contributor Author

sebest commented Mar 26, 2026

I don't think the linter error is caused by my PR?!

@Syspretor
Copy link
Copy Markdown
Collaborator

I don't think the linter error is caused by my PR?!

@sebest The current lint failure is related to a code style check. In this case, it may be acceptable to ignore it, since the lint rules can sometimes be overly strict.

We are currently reviewing the changes, and if everything looks good, we can bypass this lint failure and proceed with the merge.

@sebest
Copy link
Copy Markdown
Contributor Author

sebest commented Mar 28, 2026

If this PR is accepted, I think it opens the door to do E2E tests with Kwok nodes .
Happy to submit a follow-up PR if this one get through.

sebest added 2 commits March 28, 2026 21:29
removeNotReadyKey did not call UpdatePodReadyCondition after setting a
readiness gate back to True, unlike addNotReadyKey which already did.
This caused pods to remain Ready=False permanently in environments
without a real kubelet (e.g. KWOK), since nothing else re-evaluated the
pod's Ready condition after the readiness gate transition.

Also add LastTransitionTime to the False path in UpdatePodReadyCondition
for consistency with the True path and Kubernetes API conventions.
Add tests that reproduce the bug where removeNotReadyKey left the pod
Ready condition stale after clearing all readiness gate not-ready keys.

- TestRemoveNotReadyKey_UpdatesPodReadyCondition: verifies the full
  add/remove cycle restores Ready=True
- TestAddRemoveNotReadyKey_MultipleKeys: verifies Ready only becomes
  True after all not-ready keys are removed
@sebest sebest force-pushed the fix/remove-not-ready-key-update-pod-ready-pr branch from fc810ec to 5cf90e6 Compare March 29, 2026 04:31
@sebest
Copy link
Copy Markdown
Contributor Author

sebest commented Apr 1, 2026

@Syspretor any remaining blocker?

Copy link
Copy Markdown
Collaborator

@Syspretor Syspretor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@Syspretor Syspretor merged commit f5e2811 into sgl-project:main Apr 10, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants