fix(recipe): fix deployment ordering to honor enabled:false#1465
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
448b588 to
993cc7d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/recipe/metadata_test.go`:
- Around line 191-215: Add the missing acceptance regression tests in
RecipeMetadataSpec/DeploymentOrder coverage to pin the new TopologicalSort
behavior. One case should verify an undeclared dependency still fails instead of
being treated as externally satisfied, and another should verify a later
override like --set <component>:enabled=true restores a default-disabled
ComponentRef into the computed order. Use the existing test structure around
TopologicalSort, edgeSatisfiedExternally, and DeploymentOrder so these
boundaries stay covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b60d9fc7-fe46-46ed-b802-3c97fa1ccc2d
📒 Files selected for processing (5)
docs/contributor/recipe.mddocs/integrator/recipe-development.mddocs/user/bundling.mdpkg/recipe/metadata.gopkg/recipe/metadata_test.go
993cc7d to
70cf007
Compare
70cf007 to
4d788a4
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler.go`:
- Around line 273-278: The dependency pruning in bundler logic is too aggressive
because it removes every dep not present in enabledSet, which also strips
undeclared dependencies before validation can catch them. Update the pruning
flow in the bundler function that builds pruned deps so it only excludes
dependencies that are explicitly declared but disabled, and preserves undeclared
entries so the existing validation path still fails as expected. Use the
enabledSet check in the same loop, but ensure undeclared dependencies are not
filtered out by that step.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: cb68001a-70d0-46c9-aa3f-7700ccc2d701
📒 Files selected for processing (8)
docs/contributor/recipe.mddocs/integrator/recipe-development.mddocs/user/bundling.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/bundler_test.gopkg/recipe/metadata.gopkg/recipe/metadata_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 360-366: The new rejection path in bundler tests should assert the
specific structured error code instead of only expecting any error. Update the
relevant Make() test case in bundler_test.go, using the existing error handling
around the recipe disabled plus setEnabled scenario, to verify
ErrCodeInvalidRequest/INVALID_REQUEST; apply the same stronger assertion to the
invalid-value case below if it uses the same path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: aa3fbaca-98a7-4011-a9b5-b2c4901f3b69
📒 Files selected for processing (8)
docs/contributor/recipe.mddocs/integrator/recipe-development.mddocs/user/bundling.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/bundler_test.gopkg/recipe/metadata.gopkg/recipe/metadata_test.go
f99a043 to
2b9fd4b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/recipe-development.md`:
- Around line 297-302: The docs currently refer to a top-level enabled field on
recipe.ComponentRef, but the resolved recipe actually stores this state under
overrides.enabled. Update the recipe-development guidance to use the real
serialized path and explain that a disabled component is represented by
overrides.enabled: false, while keeping the note about bundle-time re-enable
rejection and removal of the override to ship the component.
In `@pkg/bundler/bundler_test.go`:
- Around line 353-357: Update TestMake_DisabledDependencyPruned to exercise the
helmfile deployer instead of the default New() path, since the pruning behavior
is only validated when dependency levels are recomputed from DependencyRefs.
Locate the setup in TestMake_DisabledDependencyPruned and construct the bundler
through the same helmfile-specific path used by
TestMake_UndeclaredDependencyErrors so the regression actually covers the
pruning logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 76d41550-a95a-4bb4-a574-b9319e8f77e5
📒 Files selected for processing (8)
docs/contributor/recipe.mddocs/integrator/recipe-development.mddocs/user/bundling.mddocs/user/cli-reference.mdpkg/bundler/bundler.gopkg/bundler/bundler_test.gopkg/recipe/metadata.gopkg/recipe/metadata_test.go
The recipe and bundle layers invoke different topological-sort helpers, and
neither honored a component disabled via overrides:{enabled:false}:
- TopologicalSort (recipe resolution) and ComponentRefsTopologicalLevels
(helmfile bundle generation) iterated over every ComponentRef with no
IsEnabled() check, so a disabled component (e.g. cert-manager on OKE, where
the CSP provides its own) still appeared in deploymentOrder.
- The bundler removes disabled components from ComponentRefs but left their
dependents' DependencyRefs dangling, so the helmfile deployer's
ComponentRefsTopologicalLevels call saw an undeclared edge and failed with a
false circular-dependency error.
Recipe layer: both ordering helpers now skip disabled components and treat an
edge to a declared-but-disabled component as satisfied externally; edges to
undeclared components still error, preserving missing-dependency detection.
Bundler: prune DependencyRefs that point at filtered-out components before the
deployer recomputes levels, so the disable path works for the helmfile deployer
too. DeploymentOrder continues to be filtered from the recipe's authored order
(not recomputed), preserving existing ordering.
Re-enable is disallowed: --set <component>:enabled=true on a component the
recipe disabled is now rejected with ErrCodeInvalidRequest. The author disables
a component because the platform already provides it, so re-enabling would
install a conflicting second copy and has no authored deployment order.
Docs: recipe development guide, bundling guide, CLI reference, and contributor
recipe internals updated for disabling components and the re-enable restriction.
Fixes NVIDIA#1464
2b9fd4b to
b2e6b67
Compare
|
Follow-up refactor tracked in #1466 (unify the two topological-sort paths into one implementation). Out of scope for this bug fix. |
OKE pre-installs cert-manager as a managed addon. Use the new overrides.enabled: false mechanism (PR #1465) to disable the AICR-managed cert-manager on OKE without restructuring base.yaml. Update TestAllComponentTypesValid to skip type validation for disabled components — a disabled componentRef has no type because it is provided externally, not managed by the bundler. Update the L40S OKE UAT assert-recipe fixture: cert-manager is still present in componentRefs (with enabled: false) but is excluded from deploymentOrder since the bundler skips disabled components.
OKE pre-installs cert-manager as a managed addon. Use the new overrides.enabled: false mechanism (PR #1465) to disable the AICR-managed cert-manager on OKE without restructuring base.yaml. Update TestAllComponentTypesValid to skip type validation for disabled components — a disabled componentRef has no type because it is provided externally, not managed by the bundler. Update the L40S OKE UAT assert-recipe fixture: cert-manager is still present in componentRefs (with enabled: false) but is excluded from deploymentOrder since the bundler skips disabled components.
Summary
Honor
overrides: {enabled: false}end-to-end so a component disabled in a recipe/overlay (e.g. cert-manager on OKE, where the CSP provides its own) is excluded from both the recipe'sdeploymentOrderand the generated bundle — across all deployers, including helmfile. Re-enabling a recipe-disabled component at bundle time is explicitly disallowed.Notes for Reviewers
--set <component>:enabled=truenow errors). It would complicate the fix, and — more importantly — re-enabling at bundle time creates drift between the recipe and what is actually deployed. I'm not convinced we should allow it at all; open to discussion.Motivation / Context
The recipe and bundle layers invoke different topological-sort helpers, and neither honored a disabled component:
TopologicalSort()(recipe resolution) andComponentRefsTopologicalLevels()(helmfile bundle generation) iterated over everyComponentRefwith noIsEnabled()check, so a disabled component still appeared indeploymentOrder.ComponentRefsbut left dependents'DependencyRefsdangling, so the helmfile deployer'sComponentRefsTopologicalLevelscall saw an undeclared edge and failed with a false circular-dependency error.This blocked the clean way to drop the AICR-bundled cert-manager for OKE (OKE infra provides its own; two cert-managers on one cluster conflict), part of the P0 vanilla-OKE recipe work.
Fixes: #1464
Related: N/A
Type of Change
Component(s) Affected
pkg/recipe)pkg/bundler)docs/)Implementation Notes
Recipe layer (
pkg/recipe/metadata.go).TopologicalSort()andComponentRefsTopologicalLevels()now skip disabled components and treat an edge pointing at a declared-but-disabled component as already satisfied (assumed provided externally), so dependents such as gpu-operator order correctly instead of tripping the cycle guard. Edges to undeclared components are still counted, preserving missing-dependency detection (guarded by an existing test). Shared logic factored intocomponentSets()/edgeSatisfiedExternally().Bundler (
pkg/bundler/bundler.go). After filtering disabled components out ofComponentRefs, prune dependents'DependencyRefsthat point at the removed components, so the helmfile deployer's level computation no longer sees a dangling edge.DeploymentOrdercontinues to be filtered from the recipe's authored order (not recomputed), preserving existing ordering.Re-enable disallowed.
--set <component>:enabled=trueon a component the recipe disabled is now rejected withErrCodeInvalidRequest. The author disables a component because the platform already provides it, so re-enabling would install a conflicting second copy and has no authored deployment order.--set <component>:enabled=falsestill disables a component the recipe leaves on.Testing
go test ./pkg/recipe/... ./pkg/bundler/... golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./pkg/bundler/...pkg/recipeandpkg/bundlersuites pass; lint reports 0 issues.--set enabledprecedence test now asserts re-enabling a recipe-disabled component errors.pkg/bundler/attestationfailure (TestFetchAmbientOIDCToken) reproduces on cleanorigin/main— environmental (ambient OIDC), outside this change.Manual end-to-end verification
Built
aicrfrom this branch and temporarily disabled cert-manager (has dependents gpu-operator + nvsentinel) and k8s-ephemeral-storage-metrics (a leaf) in theh100-eks-ubuntu-trainingoverlay viaoverrides: {enabled: false}. The overlay edit was for verification only and is not part of this PR.aicr recipe—deploymentOrder14 → 12, both disabled components gone; gpu-operator + nvsentinel still ordered correctly (no false cycle); both disabled components remain incomponentRefswithenabled: false.aicr bundle(default helm deployer) — exactly 12 numbered component dirs; no disabled component anywhere.aicr bundle --deployer helmfile(the previously-broken path) — succeeds, 12 dirs, no dangling cert-manager edge / false circular-dependency error.aicr bundle --set certmanager:enabled=true— rejected with[INVALID_REQUEST] component "cert-manager" is disabled by the recipe and cannot be re-enabled.Risk Assessment
deploymentOrderstill filtered from the authored order; bundler pruning is a no-op without dangling edges). Additive tests; easy to revert.Rollout notes: N/A — backwards compatible. The only behavior change for existing users is that
--set <component>:enabled=trueon a recipe-disabled component now errors instead of silently re-including it (previously unsupported in practice — the re-enabled component was mis-ordered).Checklist
make testwith-race)make lint)git commit -S)