fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397
Conversation
|
Welcome to AICR, @rsd-darshan! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThe KWOK Argo CD sync Chainsaw test and its supporting documentation and scripts receive coordinated updates to increase assertion timeouts and fix state validation logic. The assertion timeout is raised from 5 minutes to 8 minutes in both the argocd-sync and argocd-git-sync tests, with the corresponding KWOK_ARGOCD_SYNC_TIMEOUT default increased from 300 to 480 seconds in the validation script and documentation. The validation script is updated to pass the same timeout as both Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml`:
- Around line 132-175: The header comment for the assert-all-applications-pass
step describes the old predicate that includes operationState=Succeeded for arms
3 and 4, but the actual implementation at line 175 and the description at lines
155-159 have been updated to only require sync and health states. Update the
comment bullets describing arm 3 (OutOfSync + Healthy) and arm 4 (Synced +
Degraded) to remove the operationState=Succeeded references so they match the
current implementation.
- Around line 67-70: Update the inline comment at line 69 to reference the
correct environment variable name `KWOK_ARGOCD_SYNC_TIMEOUT` instead of
`KWOK_FLUX_SYNC_TIMEOUT`, as the test is for Argo CD not Flux. Additionally,
verify that the timeout values are coordinated between the YAML assert timeout
(currently 8m) and the `--assert-timeout` flag passed from the validation script
which defaults to 300 seconds (5 minutes). Either set `KWOK_ARGOCD_SYNC_TIMEOUT`
environment variable to 480 seconds to match the 8-minute intent, or update the
YAML timeout value to match the script's default of 5 minutes, and document
which timeout is actually being used in the test execution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c5694d79-86b4-46b6-9b2a-16cb4313e4aa
📒 Files selected for processing (1)
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b916f20b-c35d-4509-973f-477d263b5ee3
📒 Files selected for processing (2)
docs/contributor/tests.mdkwok/scripts/validate-scheduling.sh
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b916f20b-c35d-4509-973f-477d263b5ee3
📒 Files selected for processing (2)
docs/contributor/tests.mdkwok/scripts/validate-scheduling.sh
🛑 Comments failed to post (2)
kwok/scripts/validate-scheduling.sh (2)
448-450:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
gitto dependency checks forargocd-gittoo.Line 450 only gates
gitforflux-git, butargocd-gitalso executesgitinpush_bundle_to_gitea()(Line 690+). This can fail late withcommand not foundinstead of a clear preflight error.Suggested patch
- [[ "$DEPLOYER" == "flux-git" ]] && required+=(git) + [[ "$DEPLOYER" == "flux-git" || "$DEPLOYER" == "argocd-git" ]] && required+=(git)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kwok/scripts/validate-scheduling.sh` around lines 448 - 450, The required dependencies array at line 450 only adds git when DEPLOYER is flux-git, but argocd-git also executes git commands in the push_bundle_to_gitea function, which will fail with a unclear command-not-found error if git is missing. Add an additional condition to also append git to the required array when DEPLOYER equals argocd-git, ensuring both flux-git and argocd-git deployments have git as a preflight requirement check.
1890-1891:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
usage()omitsargocd-gitfrom supported deployers.The help text is now out of sync with the accepted values in Line 1943, which can mislead users invoking the new lane.
Suggested patch
- --deployer <name> One of: helm (default), argocd-oci, argocd-helm-oci, - flux-oci, flux-git. + --deployer <name> One of: helm (default), argocd-oci, argocd-helm-oci, + argocd-git, flux-oci, flux-git. @@ validate-scheduling.sh --deployer argocd-oci eks-training validate-scheduling.sh --deployer argocd-helm-oci eks-training + validate-scheduling.sh --deployer argocd-git eks-training validate-scheduling.sh --deployer flux-oci eks-training validate-scheduling.sh --deployer flux-git eks-trainingAlso applies to: 1893-1899
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kwok/scripts/validate-scheduling.sh` around lines 1890 - 1891, The usage() function's help text for the --deployer option is missing the argocd-git deployer from the list of supported values, causing the documentation to be inconsistent with the actual accepted values validated in the script. Update the deployer description in the usage() function to include argocd-git in the comma-separated list of supported deployers (helm, argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text accurately reflects all available options.
mchmarny
left a comment
There was a problem hiding this comment.
The design is sound and correctly mirrors the proven flux-sync/flux-git-sync error: inverted-polarity pattern — the staged assert-root-app-succeeded gate guarding the zero-apps vacuous-pass is the right structure, and catching that the bash KWOK_ARGOCD_SYNC_TIMEOUT default made the YAML bump a no-op is a sharp fix.
One blocker, though: the same class of no-op was reintroduced on the YAML side. The converted assert-all-applications-pass step uses error: polarity, which chainsaw bounds by timeouts.error (default 10s in tests/chainsaw/chainsaw-config.yaml), not timeouts.assert. Both files need error: 8m added — the flux tests set it explicitly for exactly this reason. Without it the new all-semantics gate gets ~10s and fails spuriously on real clusters. Inline comments on both timeouts blocks, plus one nit on a vestigial DeMorgan comment.
Also worth noting: the argocd-sync / argocd-git-sync KWOK lanes don't appear to have run on this SHA (only Label/Auto-Merge checks present) and the branch is behind main — rebasing and running those lanes would catch the timeout gap directly. Not blocking beyond the code fix above.
The fleet check in tests/chainsaw/kwok/argocd-sync used a name-less assert: block, which chainsaw passes as soon as ANY matched Application satisfies the predicate (exists-semantics). Two failure modes followed: a vacuous pass when the sweep evaluates against the root Application's empty status before children materialize, and a timeout on KWOK lanes where health-gated Applications (e.g. kube-prometheus-stack waiting on Pod readiness that KWOK never delivers) never reach operationState.phase==Succeeded even though they're genuinely synced. Mirrors the proven flux-sync/flux-git-sync pattern: a staged assert-root-app-succeeded step closes the apply-time race on the root Application only, and assert-all-applications-pass switches to an inverted-polarity error: block (fails when ANY Application does NOT match the 4-arm terminal-pass predicate), which is the actual all-semantics contract. The per-child predicate drops the operationState check so health-gated apps don't hang the gate. Applied identically to the argocd-git-sync sibling per its SYNC NOTE requirement to keep these two steps byte-identical. Coordinated the timeout budget across all three places it's encoded: - spec.timeouts.assert and timeouts.error both bumped to 8m (error: ops are bounded by timeouts.error, not timeouts.assert, which chainsaw defaults to 10s — missing this would have made the bump a no-op for the converted step) - KWOK_ARGOCD_SYNC_TIMEOUT default raised from 300 to 480 in validate-scheduling.sh, since --assert-timeout/--error-timeout override the YAML value at invocation - usage() help text updated to include argocd-git in the --deployer list Fixes NVIDIA#1388
e6688f6 to
56d42cf
Compare
|
@mchmarny good catch on the error timeout — added |
Summary
Fixes issue #1388. Replaces exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one.
Changes
KWOK_ARGOCD_SYNC_TIMEOUTdefault from 300 to 480 invalidate-scheduling.sh— this env var feeds--assert-timeout, which overridesspec.timeouts.assertat invocation, so the YAML's 8m bump was a no-op until the script's default was raised tooRoot Cause
The original code used a name-less
assert:block that passed when ANY Application matched the predicate (exist-semantics). Two failure modes emerged:Test Coverage
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml,tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml,kwok/scripts/validate-scheduling.sh,docs/contributor/tests.mdReferences