Skip to content

fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
rsd-darshan:fix/argocd-sync-gate
Jun 22, 2026
Merged

fix(test): argocd-sync gate: use all-semantics, handle health-gated apps#1397
mchmarny merged 2 commits into
NVIDIA:mainfrom
rsd-darshan:fix/argocd-sync-gate

Conversation

@rsd-darshan

@rsd-darshan rsd-darshan commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes issue #1388. Replaces exist-semantics assert with inverted-polarity error: to enforce that ALL Applications reach terminal-pass state, not just any one.

Changes

  1. Added staged assert-root-app-succeeded step — gates fleet sweep against apply-time races where kubectl apply creates root with empty status and the sweep evaluates before children materialize
  2. Switched assert-all-applications-pass to inverted-polarity error: blocks when ANY Application fails predicate (all-semantics vs exist-semantics)
  3. Removed operationState.phase==Succeeded from per-child check — fixes hang on health-gated apps (e.g., kube-prometheus-stack) that stay Progressing on KWOK but are genuinely synced
  4. Increased timeout from 5m to 8m — gives the all-semantics gate (which waits for every Application, not just the first) more budget than the old exists-semantics check needed
  5. Bumped KWOK_ARGOCD_SYNC_TIMEOUT default from 300 to 480 in validate-scheduling.sh — this env var feeds --assert-timeout, which overrides spec.timeouts.assert at invocation, so the YAML's 8m bump was a no-op until the script's default was raised too

Root Cause

The original code used a name-less assert: block that passed when ANY Application matched the predicate (exist-semantics). Two failure modes emerged:

  • Apply-time race: Root app created with empty status; sweep evaluates before children exist → vacuous pass
  • Health-gate hang: Apps with op=Running waiting for Pod readiness never reach op=Succeeded → timeout

Test Coverage

  • Modified tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml, tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml, kwok/scripts/validate-scheduling.sh, docs/contributor/tests.md
  • No Go code changes
  • Syntax validated

References

@rsd-darshan rsd-darshan requested a review from a team as a code owner June 21, 2026 08:49
@copy-pr-bot

copy-pr-bot Bot commented Jun 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @rsd-darshan! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: a429b8eb-dd50-43d9-9bad-765b2c61c36e

📥 Commits

Reviewing files that changed from the base of the PR and between e6688f6 and 56d42cf.

📒 Files selected for processing (4)
  • docs/contributor/tests.md
  • kwok/scripts/validate-scheduling.sh
  • tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml
  • tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml

📝 Walkthrough

Walkthrough

The KWOK Argo CD sync Chainsaw test and its supporting documentation and scripts receive coordinated updates to increase assertion timeouts and fix state validation logic. The assertion timeout is raised from 5 minutes to 8 minutes in both the argocd-sync and argocd-git-sync tests, with the corresponding KWOK_ARGOCD_SYNC_TIMEOUT default increased from 300 to 480 seconds in the validation script and documentation. The validation script is updated to pass the same timeout as both --assert-timeout and --error-timeout to chainsaw, overriding chainsaw's default error timeout. A new step, assert-root-app-succeeded, is inserted in both tests before the fleet check; it blocks progression until the root Argo CD Application reports status.operationState.phase == Succeeded, with diagnostic dumps on failure. The existing assert-all-applications-pass step is rewritten in both tests to use an error: block with an inverted predicate (DeMorgan negation), so the test fails when any Application in the argocd namespace falls outside the four permitted sync/health terminal states; the per-application operationState.phase == Succeeded requirement is removed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1390: Updates the shared kwok/scripts/validate-scheduling.sh Argo CD sync gate wiring that is directly used alongside that PR's argocd-git-sync lane test changes and gate logic.

Suggested labels

area/ci, area/docs

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: fixing the argocd-sync gate to use all-semantics and handle health-gated apps, which aligns with the primary objective of enforcing that ALL Applications reach terminal-pass state.
Description check ✅ Passed The description comprehensively explains the changes, root causes, and rationale for the fix, including detailed information about the staged assert step, inverted-polarity error gate, timeout adjustments, and the environment variable default bump.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml`:
- Around line 132-175: The header comment for the assert-all-applications-pass
step describes the old predicate that includes operationState=Succeeded for arms
3 and 4, but the actual implementation at line 175 and the description at lines
155-159 have been updated to only require sync and health states. Update the
comment bullets describing arm 3 (OutOfSync + Healthy) and arm 4 (Synced +
Degraded) to remove the operationState=Succeeded references so they match the
current implementation.
- Around line 67-70: Update the inline comment at line 69 to reference the
correct environment variable name `KWOK_ARGOCD_SYNC_TIMEOUT` instead of
`KWOK_FLUX_SYNC_TIMEOUT`, as the test is for Argo CD not Flux. Additionally,
verify that the timeout values are coordinated between the YAML assert timeout
(currently 8m) and the `--assert-timeout` flag passed from the validation script
which defaults to 300 seconds (5 minutes). Either set `KWOK_ARGOCD_SYNC_TIMEOUT`
environment variable to 480 seconds to match the 8-minute intent, or update the
YAML timeout value to match the script's default of 5 minutes, and document
which timeout is actually being used in the test execution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c5694d79-86b4-46b6-9b2a-16cb4313e4aa

📥 Commits

Reviewing files that changed from the base of the PR and between eea2a81 and 1aa3bfe.

📒 Files selected for processing (1)
  • tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml

Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml Outdated
Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b916f20b-c35d-4509-973f-477d263b5ee3

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa3bfe and 87b7b41.

📒 Files selected for processing (2)
  • docs/contributor/tests.md
  • kwok/scripts/validate-scheduling.sh

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1890-1891: The usage() function's help text for the --deployer
option is missing the argocd-git deployer from the list of supported values,
causing the documentation to be inconsistent with the actual accepted values
validated in the script. Update the deployer description in the usage() function
to include argocd-git in the comma-separated list of supported deployers (helm,
argocd-oci, argocd-helm-oci, argocd-git, flux-oci, flux-git) so the help text
accurately reflects all available options.
- Around line 448-450: The required dependencies array at line 450 only adds git
when DEPLOYER is flux-git, but argocd-git also executes git commands in the
push_bundle_to_gitea function, which will fail with a unclear command-not-found
error if git is missing. Add an additional condition to also append git to the
required array when DEPLOYER equals argocd-git, ensuring both flux-git and
argocd-git deployments have git as a preflight requirement check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b916f20b-c35d-4509-973f-477d263b5ee3

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa3bfe and 87b7b41.

📒 Files selected for processing (2)
  • docs/contributor/tests.md
  • kwok/scripts/validate-scheduling.sh
🛑 Comments failed to post (2)
kwok/scripts/validate-scheduling.sh (2)

448-450: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add git to dependency checks for argocd-git too.

Line 450 only gates git for flux-git, but argocd-git also executes git in push_bundle_to_gitea() (Line 690+). This can fail late with command not found instead of a clear preflight error.

Suggested patch
-    [[ "$DEPLOYER" == "flux-git" ]] && required+=(git)
+    [[ "$DEPLOYER" == "flux-git" || "$DEPLOYER" == "argocd-git" ]] && required+=(git)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kwok/scripts/validate-scheduling.sh` around lines 448 - 450, The required
dependencies array at line 450 only adds git when DEPLOYER is flux-git, but
argocd-git also executes git commands in the push_bundle_to_gitea function,
which will fail with a unclear command-not-found error if git is missing. Add an
additional condition to also append git to the required array when DEPLOYER
equals argocd-git, ensuring both flux-git and argocd-git deployments have git as
a preflight requirement check.

1890-1891: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

usage() omits argocd-git from supported deployers.

The help text is now out of sync with the accepted values in Line 1943, which can mislead users invoking the new lane.

Suggested patch
-  --deployer <name>       One of: helm (default), argocd-oci, argocd-helm-oci,
-                          flux-oci, flux-git.
+  --deployer <name>       One of: helm (default), argocd-oci, argocd-helm-oci,
+                          argocd-git, flux-oci, flux-git.
@@
   validate-scheduling.sh --deployer argocd-oci eks-training
   validate-scheduling.sh --deployer argocd-helm-oci eks-training
+  validate-scheduling.sh --deployer argocd-git eks-training
   validate-scheduling.sh --deployer flux-oci eks-training
   validate-scheduling.sh --deployer flux-git eks-training

Also applies to: 1893-1899

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kwok/scripts/validate-scheduling.sh` around lines 1890 - 1891, The usage()
function's help text for the --deployer option is missing the argocd-git
deployer from the list of supported values, causing the documentation to be
inconsistent with the actual accepted values validated in the script. Update the
deployer description in the usage() function to include argocd-git in the
comma-separated list of supported deployers (helm, argocd-oci, argocd-helm-oci,
argocd-git, flux-oci, flux-git) so the help text accurately reflects all
available options.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is sound and correctly mirrors the proven flux-sync/flux-git-sync error: inverted-polarity pattern — the staged assert-root-app-succeeded gate guarding the zero-apps vacuous-pass is the right structure, and catching that the bash KWOK_ARGOCD_SYNC_TIMEOUT default made the YAML bump a no-op is a sharp fix.

One blocker, though: the same class of no-op was reintroduced on the YAML side. The converted assert-all-applications-pass step uses error: polarity, which chainsaw bounds by timeouts.error (default 10s in tests/chainsaw/chainsaw-config.yaml), not timeouts.assert. Both files need error: 8m added — the flux tests set it explicitly for exactly this reason. Without it the new all-semantics gate gets ~10s and fails spuriously on real clusters. Inline comments on both timeouts blocks, plus one nit on a vestigial DeMorgan comment.

Also worth noting: the argocd-sync / argocd-git-sync KWOK lanes don't appear to have run on this SHA (only Label/Auto-Merge checks present) and the branch is behind main — rebasing and running those lanes would catch the timeout gap directly. Not blocking beyond the code fix above.

Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
Comment thread tests/chainsaw/kwok/argocd-git-sync/chainsaw-test.yaml
Comment thread tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml
The fleet check in tests/chainsaw/kwok/argocd-sync used a name-less
assert: block, which chainsaw passes as soon as ANY matched Application
satisfies the predicate (exists-semantics). Two failure modes followed:
a vacuous pass when the sweep evaluates against the root Application's
empty status before children materialize, and a timeout on KWOK lanes
where health-gated Applications (e.g. kube-prometheus-stack waiting on
Pod readiness that KWOK never delivers) never reach
operationState.phase==Succeeded even though they're genuinely synced.

Mirrors the proven flux-sync/flux-git-sync pattern: a staged
assert-root-app-succeeded step closes the apply-time race on the root
Application only, and assert-all-applications-pass switches to an
inverted-polarity error: block (fails when ANY Application does NOT
match the 4-arm terminal-pass predicate), which is the actual
all-semantics contract. The per-child predicate drops the
operationState check so health-gated apps don't hang the gate.

Applied identically to the argocd-git-sync sibling per its SYNC NOTE
requirement to keep these two steps byte-identical.

Coordinated the timeout budget across all three places it's encoded:
- spec.timeouts.assert and timeouts.error both bumped to 8m (error:
  ops are bounded by timeouts.error, not timeouts.assert, which chainsaw
  defaults to 10s — missing this would have made the bump a no-op for
  the converted step)
- KWOK_ARGOCD_SYNC_TIMEOUT default raised from 300 to 480 in
  validate-scheduling.sh, since --assert-timeout/--error-timeout
  override the YAML value at invocation
- usage() help text updated to include argocd-git in the --deployer list

Fixes NVIDIA#1388
@rsd-darshan rsd-darshan force-pushed the fix/argocd-sync-gate branch from e6688f6 to 56d42cf Compare June 22, 2026 16:09
@rsd-darshan

Copy link
Copy Markdown
Contributor Author

@mchmarny good catch on the error timeout — added error: 8m to both files and --error-timeout in the driver, squashed and rebased onto current main. Should be ready for another look.

@rsd-darshan rsd-darshan requested a review from mchmarny June 22, 2026 16:26
@github-actions

Copy link
Copy Markdown
Contributor

@mchmarny mchmarny enabled auto-merge (squash) June 22, 2026 19:09
@mchmarny mchmarny merged commit 7f7ae67 into NVIDIA:main Jun 22, 2026
149 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants