fix(bundler): stratify helmfile bundle by DAG level (#914)#926
Conversation
Replaces #922's 2-way crds.yaml/releases.yaml split with an N-way partition keyed on the recipe dependency graph. The bundler now computes topological levels via recipe.TopologicalLevels (Kahn, drain-current-generation variant) and emits one sub-helmfile per non-empty level: level-0.yaml, level-1.yaml, .... The top-level helmfile.yaml carries `helmfiles: [level-0.yaml, ...]`; helmfile processes the list sequentially, so by the time level K diffs every release in levels 0..K-1 has fully applied and the cluster's REST mapper is warm for every CRD they install. This fixes the helm-diff CRD-cycle race across the full bundle without requiring a registry flag per CRD-installer chart. Cross- chart races are now handled by the DAG itself — `dependencyRefs` already encodes the relationship. Why not just keep installsCRDs: The 2-way split needed the flag to decide placement, and it papered over the fact that some charts shipping CRDs are also CRD-consumers of sibling charts (nvsentinel consumes prometheus-operator-crds). The stratified design has no such tier confusion; flag removed. What still needs a registry mark: hasSelfRefCRDs: true. Some charts contain a CRD AND a CR of that kind in the same release (gpu-operator's ClusterPolicy template, kai-scheduler's default-queue, kubeflow-trainer's ClusterTrainingRuntime runtimes). helm-diff renders both and validates the CR before helm has applied the CRD — no inter-release ordering can fix that. The new HasSelfRefCRDs registry field emits helmfile per-release `disableValidation: true` for these. Related changes: - Split kube-prometheus-stack: new prometheus-operator-crds component (chart 28.0.1, matches operator v0.90.1 pinned by kube-prometheus-stack 84.4.0). kube-prometheus-stack now runs with crds.enabled: false and depends on prometheus-operator-crds via DependencyRefs. - nvsentinel installsCRDs removed (nothing else in the bundle creates HealthCheck/NodeHealth CRs; its PodMonitor consumer relationship is satisfied transitively). - New TopologicalLevels function in pkg/recipe/metadata.go with 14 unit-test cases (cycle detection, missing-dep, diamond, wide root, realistic AICR-shape, alphabetical tie-break, cross-check against TopologicalSort). - splitFoldersByCRD → splitFoldersByLevel (N-way partition). - HelmDefaults/Release struct: DisableValidation per-release; server-side apply NOT emitted (helmfile v1 removed per-release `args:`, and helmDefaults.args leaks into helm-diff which doesn't accept --server-side without HELM_DIFF_IGNORE_UNKNOWN_FLAGS). - Bundle goldens regenerated (testdata/upstream_helm_only/); chainsaw bundle-helmfile + ai-conformance offline test updated for the level-N.yaml file pattern; UAT + chainsaw cli recipe assertions updated for the new prometheus-operator-crds entry. - BOM regenerated; component-catalog, api-reference, design/006 image-pinning, ai-conformance README updated. Known follow-up (NOT in this PR): - TopologicalSort and TopologicalLevels produce different flat orderings due to Kahn tie-break vs. level-drain. Folder numbering (001-N, driven by TopologicalSort) doesn't strictly match install order (driven by levels). Cosmetic; bundle deploys correctly. Will be unified in a follow-up. Fixes: #914
📝 WalkthroughWalkthroughThis PR refactors the helmfile bundle generation to resolve fresh-cluster CRD/CR ordering failures (issue Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml`:
- Around line 93-95: Two inline comments that assert the component count still
read "15" even though prometheus-operator-crds was added; update both
occurrences of the "15" component-count comments to "16" so the comment matches
the new expectation (search for the component-count/comment strings that
reference "15" near the list of components including "prometheus-operator-crds"
and replace them with "16").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 03336542-6503-4fa7-a170-84c235a0c906
📒 Files selected for processing (36)
demos/cuj2-demo.mddocs/design/006-image-pinning-policy.mddocs/user/api-reference.mddocs/user/component-catalog.mddocs/user/container-images.mdexamples/recipes/aks-training.yamlexamples/recipes/eks-gb200-ubuntu-training-with-validation.yamlexamples/recipes/eks-training.yamlexamples/recipes/kind.yamlpkg/bundler/deployer/helmfile/crd_split_test.gopkg/bundler/deployer/helmfile/helmfile.gopkg/bundler/deployer/helmfile/helmfile_test.gopkg/bundler/deployer/helmfile/releases.gopkg/bundler/deployer/helmfile/testdata/kai_scheduler_async/helmfile.yamlpkg/bundler/deployer/helmfile/testdata/mixed_gpu_operator/helmfile.yamlpkg/bundler/deployer/helmfile/testdata/upstream_helm_only/helmfile.yamlpkg/bundler/deployer/helmfile/testdata/upstream_helm_only/level-0.yamlpkg/bundler/deployer/helmfile/testdata/upstream_helm_only/level-1.yamlpkg/recipe/components.gopkg/recipe/metadata.gopkg/recipe/metadata_test.gorecipes/components/kube-prometheus-stack/values.yamlrecipes/components/prometheus-operator-crds/values.yamlrecipes/overlays/base.yamlrecipes/registry.yamltests/chainsaw/ai-conformance/README.mdtests/chainsaw/ai-conformance/offline/assert-recipe.yamltests/chainsaw/ai-conformance/offline/chainsaw-test.yamltests/chainsaw/cli/bundle-helmfile/chainsaw-test.yamltests/chainsaw/cli/cuj1-training/assert-recipe.yamltests/uat/aws/tests/cuj1-training/assert-recipe.yamltests/uat/aws/tests/cuj2-inference/assert-recipe.yamltests/uat/azure/tests/cuj1-training/assert-recipe.yamltests/uat/azure/tests/cuj2-inference/assert-recipe.yamltests/uat/gcp/tests/cuj1-training/assert-recipe.yamltests/uat/gcp/tests/cuj2-inference/assert-recipe.yaml
| agentgateway agentgateway-crds \ | ||
| prometheus-operator-crds kube-prometheus-stack \ | ||
| nvidia-dra-driver-gpu nvsentinel prometheus-adapter \ |
There was a problem hiding this comment.
Update stale component-count comments to match the new expectation.
After adding prometheus-operator-crds, the file now validates 16 components/directories, but Line 24 and Line 84 still say "15". Please update both comments to avoid confusion during test maintenance.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml` around lines 93 -
95, Two inline comments that assert the component count still read "15" even
though prometheus-operator-crds was added; update both occurrences of the "15"
component-count comments to "16" so the comment matches the new expectation
(search for the component-count/comment strings that reference "15" near the
list of components including "prometheus-operator-crds" and replace them with
"16").
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
mchmarny
left a comment
There was a problem hiding this comment.
The DAG-stratification thesis is right: removing the hand-marked installsCRDs flag and letting dependencyRefs drive ordering is structurally simpler and handles the cross-chart cases that #922's 2-way split couldn't. TopologicalLevels (with the level-by-level Kahn batching) is well-tested, and the chainsaw test was updated to walk the new layout. Per-release disableValidation: true is the right call for self-reference cases like gpu-operator and kai-scheduler.
Findings:
- Medium — stale bundle README. The template at
pkg/bundler/deployer/helmfile/templates/README.md.tmpllines 26-31 still referencescrds.yaml/releases.yaml. Every multi-level bundle ships an inaccurate operator-facing README. TheHasCRDLayerfield that gates it also needs renaming. Details inline athelmfile.go:168. - Medium — brittle transitive chain for monitoring CRDs. nvsentinel relies on a three-hop dep path to prometheus-operator-crds; a future refactor of an intermediate node silently re-introduces the helm-diff race. Worth a direct edge or at least a TODO. Details inline at
recipes/registry.yaml. - Low —
releases.go:216cascadesdisableValidation: trueto injected-pre/-postfolders of a self-ref-CRDs parent (visible inmixed_gpu_operatorgolden). Probably intentional but the inline rationale doesn't explain why. - Nit — duplicated comment line in
releases.go:209-210. Stalecrds.yaml/releases.yamltext also lingers inhelmfile.go:363-365andhelmfile_test.go:753,790(those lines are outside the diff hunks, mentioning here for the sweep).
Nothing blocks merge; the README fix is worth catching before this lands since it ships into every bundle. CI Tier 2 jobs were still running at review time.
| // Emit either the single-file helmfile.yaml or the stratified | ||
| // helmfiles: layout depending on whether the dependency DAG | ||
| // produces more than one level (issue #914). | ||
| splitLayout, err := g.writeHelmfileLayout(outputDir, output, writeResult.Folders, sortedRefs, namespaceByComponent) |
There was a problem hiding this comment.
Medium — stale user-facing README. The splitLayout bool flows into readmeData.HasCRDLayer and gates a block in templates/README.md.tmpl (lines 26-31) that still tells operators to look for crds.yaml and releases.yaml:
For bundles that include CRD-owning charts,
`helmfile.yaml` is a thin orchestration document with a `helmfiles:` list:
`crds.yaml` is processed first to register CRDs, then `releases.yaml`
installs every chart that consumes them.
Those files no longer exist in this layout — the sub-helmfiles are now level-0.yaml … level-N.yaml. As-is, anyone reading the bundle's own README will be looking for files the bundler doesn't emit. The first sentence ("For bundles that include CRD-owning charts…") is also no longer accurate — any recipe whose DAG produces more than one level will render this block.
Action items:
- Rewrite the template block to describe the stratified layout (one sub-helmfile per dependency level, processed in order).
- Rename
readmeData.HasCRDLayer(and the relatedsplitLayoutlocal) to something likeHasSubHelmfiles/stratified— the current name claims something the field no longer means. - Regenerate the golden at
pkg/bundler/deployer/helmfile/testdata/upstream_helm_only/README.md(lines 25-30 carry the same stale text) viago test -update ./pkg/bundler/deployer/helmfile/....
| // Registry-driven per-release flags. HasSelfRefCRDs emits | ||
| // Registry-driven per-release flags. HasSelfRefCRDs emits |
There was a problem hiding this comment.
nit: this comment line is duplicated — // Registry-driven per-release flags. HasSelfRefCRDs emits appears twice. Drop one.
| // they ship in `crds/` (e.g., gpu-operator's clusterpolicy | ||
| // template). Keyed by f.Parent so injected -pre / -post | ||
| // folders inherit the same flag. | ||
| if flags[f.Parent].HasSelfRefCRDs { |
There was a problem hiding this comment.
Inheriting disableValidation: true to every f.Parent == gpu-operator folder (primary + injected -pre/-post) is over-broad and shows up in goldens like testdata/mixed_gpu_operator/helmfile.yaml where gpu-operator-post (just a RuntimeClass manifest) ends up with disableValidation: true. The post-folder is a local-helm wrapper around arbitrary manifests — it doesn't ship the chart's crds/ and doesn't self-reference any CRD, so there's no reason to disable validation on it.
The trade-off is small (only the post-folder loses helm-diff's mapper sanity check), so this is fine to defer if it simplifies the code, but worth a comment acknowledging the choice. Today the inline comment says "so injected -pre / -post folders inherit the same flag" without explaining why we want that inheritance — it reads as accidental coupling rather than intentional.
| # nvsentinel's templates create monitoring.coreos.com/v1 PodMonitor | ||
| # CRs that prometheus-operator-crds must register first. Ordering | ||
| # is satisfied transitively in the DAG-stratified helmfile layout: | ||
| # nvsentinel depends on gpu-operator, which depends on | ||
| # kube-prometheus-stack, which depends on prometheus-operator-crds, | ||
| # so by the time nvsentinel's level diffs, the monitoring CRDs are | ||
| # already registered. Issue #914. |
There was a problem hiding this comment.
Medium — brittle transitive chain. This ordering relies on a three-hop transitive path: nvsentinel → gpu-operator → kube-prometheus-stack → prometheus-operator-crds. If anyone removes the gpu-operator → kube-prometheus-stack edge in recipes/overlays/base.yaml (it's plausible — gpu-operator doesn't strictly need kube-prometheus-stack to install, only to scrape metrics), nvsentinel's helm-diff race against PodMonitor CRD silently returns and there's no test that catches it.
The whole PR's thesis is "let dependencyRefs encode the ordering" — applying that consistently means nvsentinel should declare a direct dependencyRefs: [prometheus-operator-crds] (in addition to its existing edges). Same applies to any other component whose templates create monitoring.coreos.com CRs. The DAG-flattening makes the extra edge a no-op when the transitive path already exists, but it survives refactors of the intermediate nodes.
Worth doing in this PR while the rationale is fresh; otherwise drop a "TODO: harden direct edges for monitoring-CR creators" so the gap is documented.
Summary
Replace #922's 2-way
crds.yaml/releases.yamlsplit with an N-way DAG-stratifiedlayout: one sub-helmfile per topological level (
level-0.yaml…level-N.yaml).Cross-chart CRD ordering is now handled by
dependencyRefsalone; theinstallsCRDsflag is removed. Self-reference cases (CRD + CR in the samerelease) get a per-release
disableValidation: truedriven by a newhasSelfRefCRDsregistry field.Motivation / Context
Issue #914 is the helm-diff "CRD not yet registered" race that fires on a fresh
cluster when a chart's templates reference CRDs installed by a sibling (or by
the same chart). #922 fixed the cross-chart case for a hand-marked subset of
charts via a 2-way sub-helmfile split. In live testing on kind we discovered
several gaps:
installsCRDs: true(no in-bundle consumerof HealthCheck / NodeHealth CRs), causing it to land in
crds.yamland raceagainst prometheus-operator-crds' diff in the same tier.
NodeFeatureRuleCR depends on nfd's CRD, which wasn'tmarked.
their own
crds/directory from their own templates — a failure mode the2-way split can't solve regardless of which tier they land in.
templates/, so its owntemplates render against an unregistered REST mapper even on fresh installs.
The fix is to stop hand-marking charts and let the DAG do the ordering. A
chart that produces CRDs and a chart that consumes them are already related
via
dependencyRefs; the bundler now reads that graph directly.Fixes: #914
Related: #922
Type of Change
Component(s) Affected
pkg/recipe)pkg/bundler,pkg/component/*)docs/,examples/)Implementation Notes
recipe.TopologicalLevels(new inpkg/recipe/metadata.go): Kahn variantthat drains the entire current zero-in-degree generation as one batch before
processing its dependents. Returns
[][]stringwhere every component in leveliis mutually independent (no edges among them) and depends only oncomponents in levels
<i. 14 unit-test cases cover cycle detection(self-loop, two- and three-node), missing-dep surfacing as cycle, diamond,
wide root, the realistic AICR-shape, alphabetical tie-break within a level,
and a cross-check against
TopologicalSortthat flattened levels respectevery
dependencyRefsedge.Helmfile bundler refactor (
pkg/bundler/deployer/helmfile/):splitFoldersByCRD→splitFoldersByLevel. N-way partition by parentcomponent's DAG level. Folders with unknown parents default to level 0
(defensive).
writeHelmfileLayoutemits one sub-helmfile per non-empty level(
level-0.yaml,level-1.yaml, ...) and a top-levelhelmfile.yamlcarrying
helmfiles: [level-0.yaml, ...]. Single-level bundles collapseto a single
helmfile.yaml(no sub-helmfile overhead).needs:edges dissolve because helmfile processes thehelmfiles:list sequentially. Within a level there's still aneeds:chain for deterministic install order.
Registry schema:
hasSelfRefCRDs: boolfield. Drives per-releasedisableValidation: truein helmfile. Marked components:
gpu-operator(ClusterPolicy template +CRD),
kai-scheduler(default-queue / default-shard),kubeflow-trainer(default ClusterTrainingRuntimes).
installsCRDsfield removed. The DAG-level partition makes it redundantfor placement, and helmfile v1 doesn't accept
args:per release (so wecan't piggyback SSA on it). Removed from
ComponentConfig, every registryentry, and
TestRegistry_CRDOwnersStillMarked.kube-prometheus-stack split:
prometheus-operator-crdscomponent (chart28.0.1, matches operatorv0.90.1pinned by kube-prometheus-stack84.4.0).kube-prometheus-stackships withcrds.enabled: falseand declaresdependencyRefs: [prometheus-operator-crds], so the operator chart landsin a later level than its CRDs.
Server-side apply NOT emitted: helmfile v1 removed per-release
args:,and
helmDefaults.argspasses flags to helm-diff too (which doesn't accept--server-side). The supported escape hatch isHELM_DIFF_IGNORE_UNKNOWN_FLAGS=trueas an env var — explicitly rejected because the bundle should be self-contained.
If the annotation-too-large failure surfaces in practice, follow-up will mark
specific releases with another registry flag.
Known follow-up (NOT in this PR):
TopologicalSort(Kahn with per-dequeue tie-break) andTopologicalLevels(Kahn with generation-drain) produce different flat orderings. Folder
numbering (001-N, driven by
TopologicalSort) doesn't strictly matchinstall order (driven by levels). Cosmetic — bundle deploys correctly
either way. Unifying is a separate PR because it touches every chainsaw
Testing
Live tested against a fresh kind cluster:
PodMonitor race, network-operator NodeFeatureRule race
likewise
Per-package coverage on the affected paths:
level-N.yaml layout
Risk Assessment
generated bundle layout. The DAG-based partition is a strictly more general
algorithm than the previous flag-based one; existing recipes produce
correct (and structurally simpler) bundles. The schema change is removing
a field, not adding one — recipes that referenced installsCRDs: true
in YAML overlays would be silently ignored on the new schema, so a quick
grep across consumers before merge is worthwhile.
Rollout notes:
level-N.yaml layout instead of crds.yaml + releases.yaml. Both layouts
are valid helmfile.yaml inputs; existing helmfile apply / helmfile diff invocations against the bundle directory work without changes.
generated bundle needs to be updated to level-*.yaml. (None known
in-tree; chainsaw bundle-helmfile test updated.)
Checklist