Skip to content

fix(bundler): split helmfile bundle into CRD + main sub-helmfiles#922

Merged
mchmarny merged 2 commits into
mainfrom
fix/914-helmfile-crd-split
May 15, 2026
Merged

fix(bundler): split helmfile bundle into CRD + main sub-helmfiles#922
mchmarny merged 2 commits into
mainfrom
fix/914-helmfile-crd-split

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

helm-diff renders every release against the live cluster's REST mapper before any release is installed. On a fresh cluster this fails for charts whose templates reference a CRD installed by a different chart in the same bundle (e.g., nodewright-customizations uses kind: Skyhook, registered by nodewright-operator). helmfile's needs: directive sequences installs but not the diff pre-pass, so the document aborts before any mutation with no matches for kind "Skyhook" in version "skyhook.nvidia.com/v1alpha1".

This PR implements option C1 from the issue: when any referenced component is registry-marked installsCRDs, the helmfile deployer emits a multi-helmfiles: document so CRDs land before consumers render.

Fixes: #914
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update (demo workaround removed; generated README explains the layout)

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)
  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

Layout when split is active:

bundle/
  helmfile.yaml          # helmfiles: [crds.yaml, releases.yaml]
  crds.yaml              # CRD-owner releases, with their own repositories + needs chain
  releases.yaml          # every other release, separate repositories + needs chain
  001-cert-manager/      # per-component folders unchanged
  002-…/

helmfile processes the helmfiles: list sequentially, so by the time releases.yaml renders, every CRD installed by crds.yaml is registered with the cluster's REST mapper. The cross-sub-helmfile needs: edge (e.g., gpu-operator → cert-manager) dissolves because sub-helmfile sequencing already enforces ordering.

Decision tree in Generator.writeHelmfileLayout:

  • No CRD-owner components → single helmfile.yaml (legacy path, byte-identical for non-mixed bundles).
  • Every component is a CRD-owner → single helmfile.yaml (split would add no value).
  • Mixed → emit the three-file split.

Registry marks (installsCRDs: true) added to:

  • cert-manager — Certificate / Issuer / ClusterIssuer / Order / Challenge / CertificateRequest.
  • nodewright-operator — Skyhook (the originating failure).
  • kube-prometheus-stack — ServiceMonitor / PodMonitor / Alertmanager / Prometheus / etc.
  • nvsentinel — HealthCheck / NodeHealth.
  • agentgateway-crds — CRDs-only chart.
  • slinky-slurm-operator-crds — CRDs-only chart.

Other components (gpu-operator, network-operator, nvidia-dra-driver-gpu, kueue, kubeflow-trainer, etc.) also install CRDs but are not consumed by other charts in the bundle — leaving them unmarked keeps the main releases.yaml chain longer and the split surface smaller. Easy to opt in if a future consumer surfaces the same bug.

Tests added:

  • pkg/bundler/deployer/helmfile/crd_split_test.goTestSplitFoldersByCRD (partition shape, pre/post inheritance), TestSplitFoldersByCRD_NilCRDSet, TestGenerate_SplitLayout (end-to-end three-file layout + needs dissolution), TestGenerate_AllCRD_CollapsesToSingleFile, TestGenerate_NoCRD_KeepsSingleFile, TestRegistry_CRDOwnersStillMarked (sentinel guard against rename/unmark).
  • pkg/bundler/deployer/helmfile/helmfile_test.goassertHelmfileShape refactored into split-aware top-level walker + leaf checker; upstream_helm_only scenario flipped to the split layout with new crds.yaml + releases.yaml goldens.
  • tests/chainsaw/cli/bundle-helmfile/chainsaw-test.yaml — every step that inspected helmfile.yaml's leaf shape now resolves a LEAVES list (one entry in the single-file case, two in the split case) and iterates.

Generated README gains: helm-diff plugin prerequisite (helmfile invokes it under the hood for apply / diff) and a paragraph explaining the helmfiles: layout when HasCRDLayer is true.

Testing

make qualify

All 21 chainsaw tests + go test -race + golangci-lint + yamllint pass. Coverage on pkg/bundler/deployer/helmfile unchanged.

Manual end-to-end via chainsaw test ./tests/chainsaw/cli/bundle-helmfile --no-cluster against an eks-h100-ubuntu-training recipe (which triggers the split because cert-manager + kube-prometheus-stack + nodewright-operator + nvsentinel are all marked): top-level helmfile.yaml carries helmfiles: [crds.yaml, releases.yaml], both sub-files parse, helmfile build resolves the cross-file release graph cleanly.

Risk Assessment

  • Low — Additive on bundles without CRD-marked components (single-file layout preserved byte-for-byte). For bundles that DO have marked components, the previous output was broken on fresh clusters; the new output is the documented helmfile pattern. Easy to revert via git revert.

Rollout notes: No CLI flag, no API surface change. Operators of existing bundles will see the split layout the next time they regenerate; helmfile apply continues to work without flags. The README in the generated bundle explains the new layout.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

helm-diff (and therefore `helmfile apply` / `helmfile diff`) renders
every release against the live cluster's REST mapper before any
release is installed. On a fresh cluster this fails for charts whose
templates reference a CRD installed by a different chart in the same
bundle (e.g., nodewright-customizations uses kind: Skyhook, registered
by nodewright-operator). `helmfile`'s needs: directive sequences
installs but not the diff pre-pass, so the document aborts before any
mutation.

Fix: when any referenced component is registry-marked installsCRDs,
the helmfile deployer emits a multi-helmfiles document:

    helmfile.yaml      # helmfiles: [crds.yaml, releases.yaml]
    crds.yaml          # every CRD-owner release
    releases.yaml      # everything else

helmfile processes the helmfiles: list sequentially, so by the time
releases.yaml renders, every CRD installed by crds.yaml is registered
with the cluster's REST mapper. Single-file layout is preserved when
no marked component is present (no behavior change for non-mixed
bundles), and the path collapses to a single helmfile.yaml in the
pathological "every component is a CRD-owner" case.

Marks cert-manager, nodewright-operator, kube-prometheus-stack,
nvsentinel, agentgateway-crds, and slinky-slurm-operator-crds as
installsCRDs: true. Adds split-aware shape assertions, dedicated
unit tests for the partition + layout decision, golden files for the
mixed case, and a registry-sentinel guard. Generated README documents
the helm-diff plugin prerequisite and the helmfiles: explainer.
Chainsaw test cli-bundle-helmfile gains per-leaf iteration.

Demo cuj1-gke-config.md: drops `--skip-diff-on-install`.

Fixes #914
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

lockwobr
lockwobr previously approved these changes May 15, 2026

@lockwobr lockwobr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile 85.79% (-1.97%) 👎
github.com/NVIDIA/aicr/pkg/recipe 90.23% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile/helmfile.go 79.71% (+0.16%) 138 (+50) 110 (+40) 28 (+10) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile/releases.go 100.00% (ø) 59 59 0
github.com/NVIDIA/aicr/pkg/recipe/components.go 86.54% (ø) 104 90 14

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Three follow-ups from the review on the helmfile CRD-split PR:

1. Bound the test Generate calls with context.WithTimeout
   (testGenerateTimeout = 30s). The pkg/bundler I/O path should
   carry a deadline per project rules; context.Background was a
   gap.

2. Differentiate "file exists" from other os.Stat errors in the
   single-file regression guards. The prior `\!os.IsNotExist(err)`
   would have silently failed-open on a permission error. Extracted
   the duplicated check into assertSubHelmfilesAbsent.

3. Critical: fix the subshell exit in the chainsaw
   per-component-values-paths step. `awk | while … exit 1` only
   exits the subshell; the parent script kept returning 0 on
   broken bundles. Switched to a temp-file intermediate (process
   substitution would be cleaner but is bash-only — chainsaw runs
   scripts under /bin/sh).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/deployer/helmfile/crd_split_test.go`:
- Around line 200-202: The test discards the error from yaml.Unmarshal into
TopHelmfile (variable top) before asserting on top.Helmfiles; change the call to
capture and check the error (e.g. err := yaml.Unmarshal(data, &top)) and fail
the test immediately on error (using t.Fatalf/t.Fatal or the same assertion
helper used for the later Helmfile unmarshal) so you don't proceed with a
zero-value top; reference the TopHelmfile unmarshalling site and mirror the
error handling used for the later Helmfile unmarshal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e89f5f98-512d-4e23-8f42-64bdc4ab79ce

📥 Commits

Reviewing files that changed from the base of the PR and between 80b2038 and 4a59d24.

📒 Files selected for processing (2)
  • pkg/bundler/deployer/helmfile/crd_split_test.go
  • tests/chainsaw/cli/bundle-helmfile/chainsaw-test.yaml

Comment thread pkg/bundler/deployer/helmfile/crd_split_test.go
@mchmarny mchmarny enabled auto-merge (squash) May 15, 2026 17:14
@mchmarny mchmarny merged commit 2a865c6 into main May 15, 2026
89 checks passed
@mchmarny mchmarny deleted the fix/914-helmfile-crd-split branch May 15, 2026 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: helmfile bundle fails on fresh cluster — CRD/CR render order between nodewright-operator and nodewright-customizations

2 participants