Skip to content

fix(recipe): fix deployment ordering to honor enabled:false#1465

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/recipe-topo-honor-enabled
Jun 25, 2026
Merged

fix(recipe): fix deployment ordering to honor enabled:false#1465
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/recipe-topo-honor-enabled

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Honor overrides: {enabled: false} end-to-end so a component disabled in a recipe/overlay (e.g. cert-manager on OKE, where the CSP provides its own) is excluded from both the recipe's deploymentOrder and the generated bundle — across all deployers, including helmfile. Re-enabling a recipe-disabled component at bundle time is explicitly disallowed.

Notes for Reviewers

  • Broader than it first looked. The recipe layer and the helmfile bundle layer use different topological-sort code paths, so fixing only the recipe layer left the helmfile deployer still failing with a false circular-dependency error on a disabled depended-upon component. The fix covers both: skip disabled components in the recipe-layer sort, and prune dangling dependency edges in the bundler before the helmfile deployer recomputes levels.
  • Re-enabling a recipe-disabled component is disallowed (--set <component>:enabled=true now errors). It would complicate the fix, and — more importantly — re-enabling at bundle time creates drift between the recipe and what is actually deployed. I'm not convinced we should allow it at all; open to discussion.

Motivation / Context

The recipe and bundle layers invoke different topological-sort helpers, and neither honored a disabled component:

  • TopologicalSort() (recipe resolution) and ComponentRefsTopologicalLevels() (helmfile bundle generation) iterated over every ComponentRef with no IsEnabled() check, so a disabled component still appeared in deploymentOrder.
  • The bundler removes disabled components from ComponentRefs but left dependents' DependencyRefs dangling, so the helmfile deployer's ComponentRefsTopologicalLevels call saw an undeclared edge and failed with a false circular-dependency error.

This blocked the clean way to drop the AICR-bundled cert-manager for OKE (OKE infra provides its own; two cert-managers on one cluster conflict), part of the P0 vanilla-OKE recipe work.

Fixes: #1464
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler)
  • Docs/examples (docs/)

Implementation Notes

Recipe layer (pkg/recipe/metadata.go). TopologicalSort() and ComponentRefsTopologicalLevels() now skip disabled components and treat an edge pointing at a declared-but-disabled component as already satisfied (assumed provided externally), so dependents such as gpu-operator order correctly instead of tripping the cycle guard. Edges to undeclared components are still counted, preserving missing-dependency detection (guarded by an existing test). Shared logic factored into componentSets() / edgeSatisfiedExternally().

Bundler (pkg/bundler/bundler.go). After filtering disabled components out of ComponentRefs, prune dependents' DependencyRefs that point at the removed components, so the helmfile deployer's level computation no longer sees a dangling edge. DeploymentOrder continues to be filtered from the recipe's authored order (not recomputed), preserving existing ordering.

Re-enable disallowed. --set <component>:enabled=true on a component the recipe disabled is now rejected with ErrCodeInvalidRequest. The author disables a component because the platform already provides it, so re-enabling would install a conflicting second copy and has no authored deployment order. --set <component>:enabled=false still disables a component the recipe leaves on.

Testing

go test ./pkg/recipe/... ./pkg/bundler/...
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./pkg/bundler/...
  • pkg/recipe and pkg/bundler suites pass; lint reports 0 issues.
  • Recipe topo tests gained disabled-leaf and disabled-dependency cases; the bundler --set enabled precedence test now asserts re-enabling a recipe-disabled component errors.
  • Pre-existing, unrelated pkg/bundler/attestation failure (TestFetchAmbientOIDCToken) reproduces on clean origin/main — environmental (ambient OIDC), outside this change.

Manual end-to-end verification

Built aicr from this branch and temporarily disabled cert-manager (has dependents gpu-operator + nvsentinel) and k8s-ephemeral-storage-metrics (a leaf) in the h100-eks-ubuntu-training overlay via overrides: {enabled: false}. The overlay edit was for verification only and is not part of this PR.

  1. aicr recipedeploymentOrder 14 → 12, both disabled components gone; gpu-operator + nvsentinel still ordered correctly (no false cycle); both disabled components remain in componentRefs with enabled: false.
  2. aicr bundle (default helm deployer) — exactly 12 numbered component dirs; no disabled component anywhere.
  3. aicr bundle --deployer helmfile (the previously-broken path) — succeeds, 12 dirs, no dangling cert-manager edge / false circular-dependency error.
  4. aicr bundle --set certmanager:enabled=truerejected with [INVALID_REQUEST] component "cert-manager" is disabled by the recipe and cannot be re-enabled.

Risk Assessment

  • Low — Behavior is unchanged when no component is disabled (recipe deploymentOrder still filtered from the authored order; bundler pruning is a no-op without dangling edges). Additive tests; easy to revert.

Rollout notes: N/A — backwards compatible. The only behavior change for existing users is that --set <component>:enabled=true on a recipe-disabled component now errors instead of silently re-including it (previously unsupported in practice — the re-enabled component was mis-ordered).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner June 24, 2026 22:23
@yuanchen8911 yuanchen8911 marked this pull request as draft June 24, 2026 22:24
@yuanchen8911 yuanchen8911 changed the title fix(recipe): honor enabled:false in deployment ordering fix(recipe): honor enabled:false in deployment ordering fix Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

pkg/recipe/metadata.go now filters topological ordering to enabled component refs and treats dependency edges to declared-but-disabled components as satisfied. pkg/bundler/bundler.go now applies bundle-time enable and disable overrides, prunes dependency edges for removed components, and rejects re-enabling components disabled in the recipe. Tests and documentation were updated to match the new behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The topology fix matches the issue, but the PR breaks #1464's override-precedence requirement by rejecting --set <component>:enabled=true for recipe-disabled components. Restore bundle-time re-enable support for recipe-disabled components, or revise the linked issue if that contract is changing.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately captures the main recipe-ordering change around enabled:false.
Out of Scope Changes check ✅ Passed The rest of the changes stay within the enabled:false ordering fix and its docs/tests; no unrelated edits stand out.
Description check ✅ Passed The description matches the code changes by explaining disabled-component handling, bundle behavior, and the new re-enable restriction.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@yuanchen8911 yuanchen8911 changed the title fix(recipe): honor enabled:false in deployment ordering fix fix(recipe): fix deployment ordering to honor enabled:false Jun 24, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-topo-honor-enabled branch from 448b588 to 993cc7d Compare June 24, 2026 22:31
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 191-215: Add the missing acceptance regression tests in
RecipeMetadataSpec/DeploymentOrder coverage to pin the new TopologicalSort
behavior. One case should verify an undeclared dependency still fails instead of
being treated as externally satisfied, and another should verify a later
override like --set <component>:enabled=true restores a default-disabled
ComponentRef into the computed order. Use the existing test structure around
TopologicalSort, edgeSatisfiedExternally, and DeploymentOrder so these
boundaries stay covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b60d9fc7-fe46-46ed-b802-3c97fa1ccc2d

📥 Commits

Reviewing files that changed from the base of the PR and between 448b588 and 993cc7d.

📒 Files selected for processing (5)
  • docs/contributor/recipe.md
  • docs/integrator/recipe-development.md
  • docs/user/bundling.md
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go

Comment thread pkg/recipe/metadata_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-topo-honor-enabled branch from 993cc7d to 70cf007 Compare June 24, 2026 22:44
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-topo-honor-enabled branch from 70cf007 to 4d788a4 Compare June 24, 2026 22:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler.go`:
- Around line 273-278: The dependency pruning in bundler logic is too aggressive
because it removes every dep not present in enabledSet, which also strips
undeclared dependencies before validation can catch them. Update the pruning
flow in the bundler function that builds pruned deps so it only excludes
dependencies that are explicitly declared but disabled, and preserves undeclared
entries so the existing validation path still fails as expected. Use the
enabledSet check in the same loop, but ensure undeclared dependencies are not
filtered out by that step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cb68001a-70d0-46c9-aa3f-7700ccc2d701

📥 Commits

Reviewing files that changed from the base of the PR and between 993cc7d and 70cf007.

📒 Files selected for processing (8)
  • docs/contributor/recipe.md
  • docs/integrator/recipe-development.md
  • docs/user/bundling.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go

Comment thread pkg/bundler/bundler.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 360-366: The new rejection path in bundler tests should assert the
specific structured error code instead of only expecting any error. Update the
relevant Make() test case in bundler_test.go, using the existing error handling
around the recipe disabled plus setEnabled scenario, to verify
ErrCodeInvalidRequest/INVALID_REQUEST; apply the same stronger assertion to the
invalid-value case below if it uses the same path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: aa3fbaca-98a7-4011-a9b5-b2c4901f3b69

📥 Commits

Reviewing files that changed from the base of the PR and between 70cf007 and 4d788a4.

📒 Files selected for processing (8)
  • docs/contributor/recipe.md
  • docs/integrator/recipe-development.md
  • docs/user/bundling.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go

Comment thread pkg/bundler/bundler_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-topo-honor-enabled branch 3 times, most recently from f99a043 to 2b9fd4b Compare June 24, 2026 23:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/recipe-development.md`:
- Around line 297-302: The docs currently refer to a top-level enabled field on
recipe.ComponentRef, but the resolved recipe actually stores this state under
overrides.enabled. Update the recipe-development guidance to use the real
serialized path and explain that a disabled component is represented by
overrides.enabled: false, while keeping the note about bundle-time re-enable
rejection and removal of the override to ship the component.

In `@pkg/bundler/bundler_test.go`:
- Around line 353-357: Update TestMake_DisabledDependencyPruned to exercise the
helmfile deployer instead of the default New() path, since the pruning behavior
is only validated when dependency levels are recomputed from DependencyRefs.
Locate the setup in TestMake_DisabledDependencyPruned and construct the bundler
through the same helmfile-specific path used by
TestMake_UndeclaredDependencyErrors so the regression actually covers the
pruning logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 76d41550-a95a-4bb4-a574-b9319e8f77e5

📥 Commits

Reviewing files that changed from the base of the PR and between 4d788a4 and fc370a0.

📒 Files selected for processing (8)
  • docs/contributor/recipe.md
  • docs/integrator/recipe-development.md
  • docs/user/bundling.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/bundler_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go

Comment thread docs/integrator/recipe-development.md
Comment thread pkg/bundler/bundler_test.go
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 24, 2026 23:05
The recipe and bundle layers invoke different topological-sort helpers, and
neither honored a component disabled via overrides:{enabled:false}:

- TopologicalSort (recipe resolution) and ComponentRefsTopologicalLevels
  (helmfile bundle generation) iterated over every ComponentRef with no
  IsEnabled() check, so a disabled component (e.g. cert-manager on OKE, where
  the CSP provides its own) still appeared in deploymentOrder.
- The bundler removes disabled components from ComponentRefs but left their
  dependents' DependencyRefs dangling, so the helmfile deployer's
  ComponentRefsTopologicalLevels call saw an undeclared edge and failed with a
  false circular-dependency error.

Recipe layer: both ordering helpers now skip disabled components and treat an
edge to a declared-but-disabled component as satisfied externally; edges to
undeclared components still error, preserving missing-dependency detection.

Bundler: prune DependencyRefs that point at filtered-out components before the
deployer recomputes levels, so the disable path works for the helmfile deployer
too. DeploymentOrder continues to be filtered from the recipe's authored order
(not recomputed), preserving existing ordering.

Re-enable is disallowed: --set <component>:enabled=true on a component the
recipe disabled is now rejected with ErrCodeInvalidRequest. The author disables
a component because the platform already provides it, so re-enabling would
install a conflicting second copy and has no authored deployment order.

Docs: recipe development guide, bundling guide, CLI reference, and contributor
recipe internals updated for disabling components and the re-enable restriction.

Fixes NVIDIA#1464
@yuanchen8911 yuanchen8911 force-pushed the fix/recipe-topo-honor-enabled branch from 2b9fd4b to b2e6b67 Compare June 24, 2026 23:10
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Follow-up refactor tracked in #1466 (unify the two topological-sort paths into one implementation). Out of scope for this bug fix.

@mchmarny mchmarny enabled auto-merge (squash) June 25, 2026 10:19
@mchmarny mchmarny merged commit 9cd3fb7 into NVIDIA:main Jun 25, 2026
33 checks passed
atif1996 added a commit that referenced this pull request Jun 26, 2026
OKE pre-installs cert-manager as a managed addon. Use the new
overrides.enabled: false mechanism (PR #1465) to disable the
AICR-managed cert-manager on OKE without restructuring base.yaml.

Update TestAllComponentTypesValid to skip type validation for disabled
components — a disabled componentRef has no type because it is provided
externally, not managed by the bundler.

Update the L40S OKE UAT assert-recipe fixture: cert-manager is still
present in componentRefs (with enabled: false) but is excluded from
deploymentOrder since the bundler skips disabled components.
atif1996 added a commit that referenced this pull request Jun 28, 2026
OKE pre-installs cert-manager as a managed addon. Use the new
overrides.enabled: false mechanism (PR #1465) to disable the
AICR-managed cert-manager on OKE without restructuring base.yaml.

Update TestAllComponentTypesValid to skip type validation for disabled
components — a disabled componentRef has no type because it is provided
externally, not managed by the bundler.

Update the L40S OKE UAT assert-recipe fixture: cert-manager is still
present in componentRefs (with enabled: false) but is excluded from
deploymentOrder since the bundler skips disabled components.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recipe topo sort does not honor enabled:false (overlay component disable broken)

2 participants