fix(readiness): re-evaluate pod Ready condition in removeNotReadyKey#246
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a bug where Kubernetes Pods could remain in a Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where the PodReady condition was not being correctly re-evaluated after readiness gates were updated, potentially leaving pods stuck in a Ready=False state. The changes include adding LastTransitionTime to the PodReady condition when it's set to false and explicitly calling UpdatePodReadyCondition after modifying readiness gates. Comprehensive unit tests have been added to validate this fix, including scenarios with multiple not-ready keys. A review comment suggests an improvement for readability and to avoid re-initializing the v1.PodCondition struct by directly modifying its fields.
|
I don't think the linter error is caused by my PR?! |
@sebest The current lint failure is related to a code style check. In this case, it may be acceptable to ignore it, since the lint rules can sometimes be overly strict. We are currently reviewing the changes, and if everything looks good, we can bypass this lint failure and proceed with the merge. |
|
If this PR is accepted, I think it opens the door to do E2E tests with Kwok nodes . |
removeNotReadyKey did not call UpdatePodReadyCondition after setting a readiness gate back to True, unlike addNotReadyKey which already did. This caused pods to remain Ready=False permanently in environments without a real kubelet (e.g. KWOK), since nothing else re-evaluated the pod's Ready condition after the readiness gate transition. Also add LastTransitionTime to the False path in UpdatePodReadyCondition for consistency with the True path and Kubernetes API conventions.
Add tests that reproduce the bug where removeNotReadyKey left the pod Ready condition stale after clearing all readiness gate not-ready keys. - TestRemoveNotReadyKey_UpdatesPodReadyCondition: verifies the full add/remove cycle restores Ready=True - TestAddRemoveNotReadyKey_MultipleKeys: verifies Ready only becomes True after all not-ready keys are removed
fc810ec to
5cf90e6
Compare
|
@Syspretor any remaining blocker? |
Summary
removeNotReadyKey: After clearing all not-ready keys from a readiness gate condition,removeNotReadyKeynow callsUpdatePodReadyConditionto re-evaluate the pod'sReadycondition — matching whataddNotReadyKeyalready doesLastTransitionTimeto theReady=Falsepath inUpdatePodReadyConditionfor consistency with theTruepath and Kubernetes API conventionsRoot Cause
When
addNotReadyKeyset a readiness gate (e.g.InstancePodReady) toFalse, it calledUpdatePodReadyConditionto setReady=False. But whenremoveNotReadyKeylater cleared all keys and set the gate back toTrue, it did not callUpdatePodReadyCondition, leavingReady=Falsepermanently.In real clusters, the kubelet masks this by continuously re-evaluating readiness. In KWOK environments (no real kubelet), pods stay
Ready=Falseforever, causing RBGSAs to reportreadyReplicas=0and RBGs to showReady=False.Test plan
TestRemoveNotReadyKey_UpdatesPodReadyCondition— reproduces the exact bug (fails without the fix, passes with it)TestAddRemoveNotReadyKey_MultipleKeys— verifiesReadyonly becomesTrueafter all not-ready keys are removedgo test ./pkg/inplace/pod/...Ready=True, RBGSAs showreadyReplicasmatchingreplicas