Skip to content

fix(bundler): stratify helmfile bundle by DAG level (#914)#926

Merged
lockwobr merged 1 commit into
mainfrom
fix/crd-cycle-prometheus
May 15, 2026
Merged

fix(bundler): stratify helmfile bundle by DAG level (#914)#926
lockwobr merged 1 commit into
mainfrom
fix/crd-cycle-prometheus

Conversation

@lockwobr

@lockwobr lockwobr commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace #922's 2-way crds.yaml / releases.yaml split with an N-way DAG-stratified
layout: one sub-helmfile per topological level (level-0.yamllevel-N.yaml).
Cross-chart CRD ordering is now handled by dependencyRefs alone; the
installsCRDs flag is removed. Self-reference cases (CRD + CR in the same
release) get a per-release disableValidation: true driven by a new
hasSelfRefCRDs registry field.

Motivation / Context

Issue #914 is the helm-diff "CRD not yet registered" race that fires on a fresh
cluster when a chart's templates reference CRDs installed by a sibling (or by
the same chart). #922 fixed the cross-chart case for a hand-marked subset of
charts via a 2-way sub-helmfile split. In live testing on kind we discovered
several gaps:

  • nvsentinel was incorrectly marked installsCRDs: true (no in-bundle consumer
    of HealthCheck / NodeHealth CRs), causing it to land in crds.yaml and race
    against prometheus-operator-crds' diff in the same tier.
  • network-operator's NodeFeatureRule CR depends on nfd's CRD, which wasn't
    marked.
  • gpu-operator, kai-scheduler, and kubeflow-trainer all self-reference CRDs in
    their own crds/ directory from their own templates — a failure mode the
    2-way split can't solve regardless of which tier they land in.
  • kube-prometheus-stack's CRDs ship in a subchart templates/, so its own
    templates render against an unregistered REST mapper even on fresh installs.

The fix is to stop hand-marking charts and let the DAG do the ordering. A
chart that produces CRDs and a chart that consumes them are already related
via dependencyRefs; the bundler now reads that graph directly.

Fixes: #914
Related: #922

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Refactoring (no functional changes — installsCRDs removal, levels infra)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Docs/examples (docs/, examples/)

Implementation Notes

recipe.TopologicalLevels (new in pkg/recipe/metadata.go): Kahn variant
that drains the entire current zero-in-degree generation as one batch before
processing its dependents. Returns [][]string where every component in level
i is mutually independent (no edges among them) and depends only on
components in levels <i. 14 unit-test cases cover cycle detection
(self-loop, two- and three-node), missing-dep surfacing as cycle, diamond,
wide root, the realistic AICR-shape, alphabetical tie-break within a level,
and a cross-check against TopologicalSort that flattened levels respect
every dependencyRefs edge.

Helmfile bundler refactor (pkg/bundler/deployer/helmfile/):

  • splitFoldersByCRDsplitFoldersByLevel. N-way partition by parent
    component's DAG level. Folders with unknown parents default to level 0
    (defensive).
  • writeHelmfileLayout emits one sub-helmfile per non-empty level
    (level-0.yaml, level-1.yaml, ...) and a top-level helmfile.yaml
    carrying helmfiles: [level-0.yaml, ...]. Single-level bundles collapse
    to a single helmfile.yaml (no sub-helmfile overhead).
  • Cross-level needs: edges dissolve because helmfile processes the
    helmfiles: list sequentially. Within a level there's still a needs:
    chain for deterministic install order.

Registry schema:

  • New hasSelfRefCRDs: bool field. Drives per-release disableValidation: true
    in helmfile. Marked components: gpu-operator (ClusterPolicy template +
    CRD), kai-scheduler (default-queue / default-shard), kubeflow-trainer
    (default ClusterTrainingRuntimes).
  • installsCRDs field removed. The DAG-level partition makes it redundant
    for placement, and helmfile v1 doesn't accept args: per release (so we
    can't piggyback SSA on it). Removed from ComponentConfig, every registry
    entry, and TestRegistry_CRDOwnersStillMarked.

kube-prometheus-stack split:

  • New prometheus-operator-crds component (chart 28.0.1, matches operator
    v0.90.1 pinned by kube-prometheus-stack 84.4.0).
  • kube-prometheus-stack ships with crds.enabled: false and declares
    dependencyRefs: [prometheus-operator-crds], so the operator chart lands
    in a later level than its CRDs.

Server-side apply NOT emitted: helmfile v1 removed per-release args:,
and helmDefaults.args passes flags to helm-diff too (which doesn't accept
--server-side). The supported escape hatch is HELM_DIFF_IGNORE_UNKNOWN_FLAGS=true
as an env var — explicitly rejected because the bundle should be self-contained.
If the annotation-too-large failure surfaces in practice, follow-up will mark
specific releases with another registry flag.

Known follow-up (NOT in this PR):

  • TopologicalSort (Kahn with per-dequeue tie-break) and TopologicalLevels
    (Kahn with generation-drain) produce different flat orderings. Folder
    numbering (001-N, driven by TopologicalSort) doesn't strictly match
    install order (driven by levels). Cosmetic — bundle deploys correctly
    either way. Unifying is a separate PR because it touches every chainsaw
    • UAT recipe golden.

Testing

unset GITLAB_TOKEN && make qualify
# 21/21 chainsaw e2e tests pass
# golangci-lint, yamllint, license headers, MDX, chart-pin checks all clean
# Grype vulnerability scan clean
# Coverage: 75.1% (no per-package regression >0.5%)

Live tested against a fresh kind cluster:

  • helmfile apply on an eks/h100/training/kubeflow bundle → fixes nvsentinel
    PodMonitor race, network-operator NodeFeatureRule race
  • gpu-operator ClusterPolicy self-reference resolved via disableValidation: true on the gpu-operator release
  • kai-scheduler default-queue / default-shard self-reference resolved likewise
  • kubeflow-trainer default ClusterTrainingRuntime self-reference resolved
    likewise

Per-package coverage on the affected paths:

  • pkg/recipe: 90.2% (added 14 cases for TopologicalLevels)
  • pkg/bundler/deployer/helmfile: stable; goldens regenerated for the new
    level-N.yaml layout

Risk Assessment

  • Medium — Touches recipe, bundler, registry schema, and every
    generated bundle layout. The DAG-based partition is a strictly more general
    algorithm than the previous flag-based one; existing recipes produce
    correct (and structurally simpler) bundles. The schema change is removing
    a field, not adding one — recipes that referenced installsCRDs: true
    in YAML overlays would be silently ignored on the new schema, so a quick
    grep across consumers before merge is worthwhile.

Rollout notes:

  • No CLI flag, no API surface change.
  • Operators regenerating bundles after merge will see the new
    level-N.yaml layout instead of crds.yaml + releases.yaml. Both layouts
    are valid helmfile.yaml inputs; existing helmfile apply / helmfile diff invocations against the bundle directory work without changes.
  • Any downstream tooling that grepped for crds.yaml / releases.yaml in a
    generated bundle needs to be updated to level-*.yaml. (None known
    in-tree; chainsaw bundle-helmfile test updated.)

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

  Replaces #922's 2-way crds.yaml/releases.yaml split with an N-way
  partition keyed on the recipe dependency graph. The bundler now
  computes topological levels via recipe.TopologicalLevels (Kahn,
  drain-current-generation variant) and emits one sub-helmfile per
  non-empty level: level-0.yaml, level-1.yaml, .... The top-level
  helmfile.yaml carries `helmfiles: [level-0.yaml, ...]`; helmfile
  processes the list sequentially, so by the time level K diffs every
  release in levels 0..K-1 has fully applied and the cluster's REST
  mapper is warm for every CRD they install.

  This fixes the helm-diff CRD-cycle race across the full bundle
  without requiring a registry flag per CRD-installer chart. Cross-
  chart races are now handled by the DAG itself — `dependencyRefs`
  already encodes the relationship.

  Why not just keep installsCRDs:
  The 2-way split needed the flag to decide placement, and it papered
  over the fact that some charts shipping CRDs are also CRD-consumers
  of sibling charts (nvsentinel consumes prometheus-operator-crds).
  The stratified design has no such tier confusion; flag removed.

  What still needs a registry mark:
  hasSelfRefCRDs: true. Some charts contain a CRD AND a CR of that
  kind in the same release (gpu-operator's ClusterPolicy template,
  kai-scheduler's default-queue, kubeflow-trainer's
  ClusterTrainingRuntime runtimes). helm-diff renders both and
  validates the CR before helm has applied the CRD — no inter-release
  ordering can fix that. The new HasSelfRefCRDs registry field emits
  helmfile per-release `disableValidation: true` for these.

  Related changes:
  - Split kube-prometheus-stack: new prometheus-operator-crds
    component (chart 28.0.1, matches operator v0.90.1 pinned by
    kube-prometheus-stack 84.4.0). kube-prometheus-stack now runs
    with crds.enabled: false and depends on prometheus-operator-crds
    via DependencyRefs.
  - nvsentinel installsCRDs removed (nothing else in the bundle
    creates HealthCheck/NodeHealth CRs; its PodMonitor consumer
    relationship is satisfied transitively).
  - New TopologicalLevels function in pkg/recipe/metadata.go with 14
    unit-test cases (cycle detection, missing-dep, diamond, wide
    root, realistic AICR-shape, alphabetical tie-break, cross-check
    against TopologicalSort).
  - splitFoldersByCRD → splitFoldersByLevel (N-way partition).
  - HelmDefaults/Release struct: DisableValidation per-release;
    server-side apply NOT emitted (helmfile v1 removed per-release
    `args:`, and helmDefaults.args leaks into helm-diff which doesn't
    accept --server-side without HELM_DIFF_IGNORE_UNKNOWN_FLAGS).
  - Bundle goldens regenerated (testdata/upstream_helm_only/);
    chainsaw bundle-helmfile + ai-conformance offline test updated
    for the level-N.yaml file pattern; UAT + chainsaw cli recipe
    assertions updated for the new prometheus-operator-crds entry.
  - BOM regenerated; component-catalog, api-reference, design/006
    image-pinning, ai-conformance README updated.

  Known follow-up (NOT in this PR):
  - TopologicalSort and TopologicalLevels produce different flat
    orderings due to Kahn tie-break vs. level-drain. Folder
    numbering (001-N, driven by TopologicalSort) doesn't strictly
    match install order (driven by levels). Cosmetic; bundle deploys
    correctly. Will be unified in a follow-up.

  Fixes: #914
@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented May 15, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors the helmfile bundle generation to resolve fresh-cluster CRD/CR ordering failures (issue #914) by stratifying helmfile layout across dependency-graph depth levels and introducing a dedicated prometheus-operator-crds component to separate CRD installation from consumer charts. The core changes span the recipe metadata model (topological level grouping), helmfile layout generator (level-based sub-helmfiles instead of CRD-owner partition), release schema (disableValidation flag), component registry updates, and comprehensive test/assertion updates across all demo recipes and validation suites. The solution moves from a fixed two-file layout (crds.yaml + releases.yaml) to dynamically generated level-N.yaml files ordered by dependency depth, allowing clusters to install CRDs before dependent charts attempt to render, while the disableValidation flag skips helm-diff validation for components with self-referencing CRDs that would fail on fresh clusters.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#899: Introduces foundational helmfile deployer implementation (pkg/bundler/deployer/helmfile), which this PR extends significantly with stratified layout generation and per-release validation flags.
  • NVIDIA/aicr#922: Both PRs refactor helmfile bundle layout to split orchestration into multiple sub-helmfiles; #922 uses CRD-owner partitioning (replaced in this PR), while this PR implements dependency-depth stratification as the preferred solution.

Suggested labels

area/bundler, area/recipes, area/tests, area/docs, size/L, bug

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title 'fix(bundler): stratify helmfile bundle by DAG level (#914)' clearly and specifically describes the main change: replacing the 2-way split with a DAG-level stratification for helmfile bundles to fix issue #914.
Linked Issues check ✅ Passed The pull request successfully addresses issue #914 by implementing the preferred C1 solution: generating a multi-helmfiles document with N-way DAG-stratified levels (level-0.yaml, level-1.yaml, etc.) that process sequentially, preventing helm-diff CRD/CR render races without per-chart registry flags.
Out of Scope Changes check ✅ Passed All changes are directly related to the DAG stratification objective: recipe TopologicalLevels, helmfile bundler refactor, schema updates (hasSelfRefCRDs, installsCRDs removal), prometheus-operator-crds split, and test/documentation updates align with issue #914 resolution.
Description check ✅ Passed The PR description clearly relates to the changeset, detailing the replacement of a 2-way crds/releases split with N-way DAG-stratified helmfile layout, removal of installsCRDs flag, and introduction of hasSelfRefCRDs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/crd-cycle-prometheus

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml`:
- Around line 93-95: Two inline comments that assert the component count still
read "15" even though prometheus-operator-crds was added; update both
occurrences of the "15" component-count comments to "16" so the comment matches
the new expectation (search for the component-count/comment strings that
reference "15" near the list of components including "prometheus-operator-crds"
and replace them with "16").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 03336542-6503-4fa7-a170-84c235a0c906

📥 Commits

Reviewing files that changed from the base of the PR and between 35a706c and 63c000a.

📒 Files selected for processing (36)
  • demos/cuj2-demo.md
  • docs/design/006-image-pinning-policy.md
  • docs/user/api-reference.md
  • docs/user/component-catalog.md
  • docs/user/container-images.md
  • examples/recipes/aks-training.yaml
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • examples/recipes/eks-training.yaml
  • examples/recipes/kind.yaml
  • pkg/bundler/deployer/helmfile/crd_split_test.go
  • pkg/bundler/deployer/helmfile/helmfile.go
  • pkg/bundler/deployer/helmfile/helmfile_test.go
  • pkg/bundler/deployer/helmfile/releases.go
  • pkg/bundler/deployer/helmfile/testdata/kai_scheduler_async/helmfile.yaml
  • pkg/bundler/deployer/helmfile/testdata/mixed_gpu_operator/helmfile.yaml
  • pkg/bundler/deployer/helmfile/testdata/upstream_helm_only/helmfile.yaml
  • pkg/bundler/deployer/helmfile/testdata/upstream_helm_only/level-0.yaml
  • pkg/bundler/deployer/helmfile/testdata/upstream_helm_only/level-1.yaml
  • pkg/recipe/components.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_test.go
  • recipes/components/kube-prometheus-stack/values.yaml
  • recipes/components/prometheus-operator-crds/values.yaml
  • recipes/overlays/base.yaml
  • recipes/registry.yaml
  • tests/chainsaw/ai-conformance/README.md
  • tests/chainsaw/ai-conformance/offline/assert-recipe.yaml
  • tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml
  • tests/chainsaw/cli/bundle-helmfile/chainsaw-test.yaml
  • tests/chainsaw/cli/cuj1-training/assert-recipe.yaml
  • tests/uat/aws/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/aws/tests/cuj2-inference/assert-recipe.yaml
  • tests/uat/azure/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/azure/tests/cuj2-inference/assert-recipe.yaml
  • tests/uat/gcp/tests/cuj1-training/assert-recipe.yaml
  • tests/uat/gcp/tests/cuj2-inference/assert-recipe.yaml

Comment on lines +93 to 95
agentgateway agentgateway-crds \
prometheus-operator-crds kube-prometheus-stack \
nvidia-dra-driver-gpu nvsentinel prometheus-adapter \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale component-count comments to match the new expectation.

After adding prometheus-operator-crds, the file now validates 16 components/directories, but Line 24 and Line 84 still say "15". Please update both comments to avoid confusion during test maintenance.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/chainsaw/ai-conformance/offline/chainsaw-test.yaml` around lines 93 -
95, Two inline comments that assert the component count still read "15" even
though prometheus-operator-crds was added; update both occurrences of the "15"
component-count comments to "16" so the comment matches the new expectation
(search for the component-count/comment strings that reference "15" near the
list of components including "prometheus-operator-crds" and replace them with
"16").

@github-actions

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.4%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.4%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile 86.88% (+1.09%) 👍
github.com/NVIDIA/aicr/pkg/recipe 90.41% (+0.18%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile/helmfile.go 81.88% (+2.16%) 160 (+22) 131 (+21) 29 (+1) 👍
github.com/NVIDIA/aicr/pkg/bundler/deployer/helmfile/releases.go 100.00% (ø) 61 (+2) 61 (+2) 0
github.com/NVIDIA/aicr/pkg/recipe/components.go 86.54% (ø) 104 90 14
github.com/NVIDIA/aicr/pkg/recipe/metadata.go 96.99% (+0.37%) 266 (+29) 258 (+29) 8 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DAG-stratification thesis is right: removing the hand-marked installsCRDs flag and letting dependencyRefs drive ordering is structurally simpler and handles the cross-chart cases that #922's 2-way split couldn't. TopologicalLevels (with the level-by-level Kahn batching) is well-tested, and the chainsaw test was updated to walk the new layout. Per-release disableValidation: true is the right call for self-reference cases like gpu-operator and kai-scheduler.

Findings:

  • Medium — stale bundle README. The template at pkg/bundler/deployer/helmfile/templates/README.md.tmpl lines 26-31 still references crds.yaml / releases.yaml. Every multi-level bundle ships an inaccurate operator-facing README. The HasCRDLayer field that gates it also needs renaming. Details inline at helmfile.go:168.
  • Medium — brittle transitive chain for monitoring CRDs. nvsentinel relies on a three-hop dep path to prometheus-operator-crds; a future refactor of an intermediate node silently re-introduces the helm-diff race. Worth a direct edge or at least a TODO. Details inline at recipes/registry.yaml.
  • Lowreleases.go:216 cascades disableValidation: true to injected -pre/-post folders of a self-ref-CRDs parent (visible in mixed_gpu_operator golden). Probably intentional but the inline rationale doesn't explain why.
  • Nit — duplicated comment line in releases.go:209-210. Stale crds.yaml/releases.yaml text also lingers in helmfile.go:363-365 and helmfile_test.go:753,790 (those lines are outside the diff hunks, mentioning here for the sweep).

Nothing blocks merge; the README fix is worth catching before this lands since it ships into every bundle. CI Tier 2 jobs were still running at review time.

// Emit either the single-file helmfile.yaml or the stratified
// helmfiles: layout depending on whether the dependency DAG
// produces more than one level (issue #914).
splitLayout, err := g.writeHelmfileLayout(outputDir, output, writeResult.Folders, sortedRefs, namespaceByComponent)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — stale user-facing README. The splitLayout bool flows into readmeData.HasCRDLayer and gates a block in templates/README.md.tmpl (lines 26-31) that still tells operators to look for crds.yaml and releases.yaml:

For bundles that include CRD-owning charts,
`helmfile.yaml` is a thin orchestration document with a `helmfiles:` list:
`crds.yaml` is processed first to register CRDs, then `releases.yaml`
installs every chart that consumes them.

Those files no longer exist in this layout — the sub-helmfiles are now level-0.yamllevel-N.yaml. As-is, anyone reading the bundle's own README will be looking for files the bundler doesn't emit. The first sentence ("For bundles that include CRD-owning charts…") is also no longer accurate — any recipe whose DAG produces more than one level will render this block.

Action items:

  • Rewrite the template block to describe the stratified layout (one sub-helmfile per dependency level, processed in order).
  • Rename readmeData.HasCRDLayer (and the related splitLayout local) to something like HasSubHelmfiles / stratified — the current name claims something the field no longer means.
  • Regenerate the golden at pkg/bundler/deployer/helmfile/testdata/upstream_helm_only/README.md (lines 25-30 carry the same stale text) via go test -update ./pkg/bundler/deployer/helmfile/....

Comment on lines +209 to +210
// Registry-driven per-release flags. HasSelfRefCRDs emits
// Registry-driven per-release flags. HasSelfRefCRDs emits

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment line is duplicated — // Registry-driven per-release flags. HasSelfRefCRDs emits appears twice. Drop one.

// they ship in `crds/` (e.g., gpu-operator's clusterpolicy
// template). Keyed by f.Parent so injected -pre / -post
// folders inherit the same flag.
if flags[f.Parent].HasSelfRefCRDs {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheriting disableValidation: true to every f.Parent == gpu-operator folder (primary + injected -pre/-post) is over-broad and shows up in goldens like testdata/mixed_gpu_operator/helmfile.yaml where gpu-operator-post (just a RuntimeClass manifest) ends up with disableValidation: true. The post-folder is a local-helm wrapper around arbitrary manifests — it doesn't ship the chart's crds/ and doesn't self-reference any CRD, so there's no reason to disable validation on it.

The trade-off is small (only the post-folder loses helm-diff's mapper sanity check), so this is fine to defer if it simplifies the code, but worth a comment acknowledging the choice. Today the inline comment says "so injected -pre / -post folders inherit the same flag" without explaining why we want that inheritance — it reads as accidental coupling rather than intentional.

Comment thread recipes/registry.yaml
Comment on lines +258 to +264
# nvsentinel's templates create monitoring.coreos.com/v1 PodMonitor
# CRs that prometheus-operator-crds must register first. Ordering
# is satisfied transitively in the DAG-stratified helmfile layout:
# nvsentinel depends on gpu-operator, which depends on
# kube-prometheus-stack, which depends on prometheus-operator-crds,
# so by the time nvsentinel's level diffs, the monitoring CRDs are
# already registered. Issue #914.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — brittle transitive chain. This ordering relies on a three-hop transitive path: nvsentinel → gpu-operator → kube-prometheus-stack → prometheus-operator-crds. If anyone removes the gpu-operator → kube-prometheus-stack edge in recipes/overlays/base.yaml (it's plausible — gpu-operator doesn't strictly need kube-prometheus-stack to install, only to scrape metrics), nvsentinel's helm-diff race against PodMonitor CRD silently returns and there's no test that catches it.

The whole PR's thesis is "let dependencyRefs encode the ordering" — applying that consistently means nvsentinel should declare a direct dependencyRefs: [prometheus-operator-crds] (in addition to its existing edges). Same applies to any other component whose templates create monitoring.coreos.com CRs. The DAG-flattening makes the extra edge a no-op when the transitive path already exists, but it survives refactors of the intermediate nodes.

Worth doing in this PR while the rationale is fresh; otherwise drop a "TODO: harden direct edges for monitoring-CR creators" so the gap is documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: helmfile bundle fails on fresh cluster — CRD/CR render order between nodewright-operator and nodewright-customizations

2 participants