Skip to content

feat(flux): add bundle flux option#817

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
haarchri:feature/bundle-flux
May 13, 2026
Merged

feat(flux): add bundle flux option#817
mchmarny merged 6 commits into
NVIDIA:mainfrom
haarchri:feature/bundle-flux

Conversation

@haarchri

@haarchri haarchri commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Add a flux deployer type to the bundler, generating Flux CD HelmRelease CRs, source CRs, and a root kustomization.yaml for GitOps deployment.

Motivation / Context

Organizations using Flux as their GitOps controller had no native bundle output, they had to manually translate Helm bundles into Flux CRDs.
This adds first-class Flux support alongside the existing helm, argocd, and argocd-helm deployers.

Fixes: #617

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

First-time contributor, guidance on whether the direction is correct would be appreciated. Happy to adjust if something is missing or needs a different approach.

All components produce HelmRelease CRs. Components with Helm charts reference HelmRepository sources; manifest-only and mixed components are packaged as local Helm charts (Chart.yaml + templates/) referencing a GitRepository source.
This avoids the problem of raw Helm template syntax ({{ .Values }}, {{ .Release }}) in plain YAML, Flux's Helm controller renders the templates natively.

Testing

unset GITLAB_TOKEN && make build
rm -rf /tmp/flux-bundle && /tmp/aicr-test bundle \
  -r /Users/haarchri/Documents/aicr/recipes/overlays/h100-eks-ubuntu-training.yaml \
  -o /tmp/flux-bundle \
  --deployer flux \
  --repo https://github.com/haarchri/aicr-flux-bundle.git
cat <<'EOF' | kubectl apply -f -
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: aicr-stack
  namespace: flux-system
spec:
  url: https://github.com/haarchri/aicr-flux-bundle.git
  ref:
    branch: main
  interval: 10m
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: aicr-stack
  namespace: flux-system
spec:
  sourceRef:
    kind: GitRepository
    name: aicr-stack
  path: ./
  prune: true
  interval: 10m
EOF

Live test bundle: https://github.com/haarchri/aicr-flux-bundle — generated from h100-eks-ubuntu-training overlay,

# Unit tests (18 pass, race detector clean)
go test -v -race ./pkg/bundler/deployer/flux/...

# Lint (0 issues)
golangci-lint run -c .golangci.yaml ./pkg/bundler/deployer/flux/...

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: New deployer type only activated when --deployer flux is explicitly specified. No impact on existing helm, argocd, or argocd-helm deployers. No migration needed.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@copy-pr-bot

copy-pr-bot Bot commented May 8, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Welcome to AICR, @haarchri! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@haarchri

haarchri commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

will test with a few more receipt overlays and check if the approach for inline helm-charts (e.g. nodewright-customizations, gpu-operator-post) is working as implemented and if we need to make sure that Dynamic needs to work for this

one of the examples: https://github.com/haarchri/aicr-flux-bundle/blob/main/nodewright-customizations/templates/tuning.yaml

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

@mchmarny

mchmarny commented May 8, 2026

Copy link
Copy Markdown
Member

This will close a significant gap in AICR, thanks @haarchri

@haarchri

haarchri commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@mchmarny any upfront review possible ? makes it easier to finish the PR after more testing around

@haarchri haarchri force-pushed the feature/bundle-flux branch 2 times, most recently from 68504d7 to 57ace34 Compare May 11, 2026 13:58

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1372: Change the ordered-list prefix on the line starting with "2.
**Accelerated Selector Warning**:" to use "1." to comply with the configured
ordered-list style (1/1/1); update the list item that mentions
`--accelerated-node-selector` so its prefix is "1." instead of "2." ensuring
MD029 passes.

In `@pkg/bundler/deployer/flux/doc.go`:
- Line 16: Update the package comment in package flux to use the AICR
terminology instead of “Cloud Native Stack recipes”; modify the doc string at
the top of pkg/bundler/deployer/flux/doc.go so it reads something like “Package
flux provides Flux manifest generation for AICR (Application-Integration
Configuration Reconciler) behavior” (or similar concise phrasing referencing
AICR) so the package doc aligns with AICR terminology and avoids contributor
confusion.

In `@pkg/bundler/deployer/flux/flux_test.go`:
- Around line 101-105: In the loop that checks expectedFiles (using
filepath.Join to build fullPath and calling os.Stat), change the error handling
so any non-nil statErr causes the test to fail: if os.Stat returns an error,
call t.Fatalf or t.Errorf including the statErr in the message rather than only
checking os.IsNotExist; preserve the existing message for the not-exist case but
include the error details for other filesystem errors so the test cannot
silently pass on permission or I/O errors.
- Around line 35-37: Replace unbounded contexts in the TestGenerate_* tests by
creating a timeout-bounded context (e.g., ctx, cancel :=
context.WithTimeout(context.Background(), 10*time.Second)) and defer cancel()
before calling Generate; update TestGenerate_Success and all other
TestGenerate_* cases to pass this ctx into the Generate call so file I/O ops
cannot hang indefinitely.
- Around line 307-310: The test TestGenerate_ContextCancellation currently only
checks that Generate(ctx, t.TempDir()) returns some error; change the assertion
to verify it's a context cancellation by using errors.Is(err, context.Canceled).
Specifically, after calling g.Generate(ctx, t.TempDir()) update the check from
"if err == nil { t.Fatal(...)" to first assert err is non-nil then use if
!errors.Is(err, context.Canceled) { t.Fatalf("expected context.Canceled, got
%v", err) } so the test fails for unrelated errors.

In `@pkg/bundler/deployer/flux/flux.go`:
- Around line 204-211: The loop over sortedRefs that calls
generateComponentResources should check the context for cancellation to avoid
continuing heavy I/O after ctx is done; before each iteration (inside the for i,
ref := range sortedRefs loop) add a non-blocking check for ctx.Done() and return
early (e.g., return ctx.Err()) if canceled, ensuring generateComponentResources
(and any subsequent append to resources) isn't executed when the parent context
is canceled; reference the loop variables sortedRefs,
generateComponentResources, outputDir, helmSources, gitSources, and output when
locating the insertion point.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 7663d94b-b826-4125-aa89-ccae7c83e770

📥 Commits

Reviewing files that changed from the base of the PR and between 7f86f42 and 57ace34.

📒 Files selected for processing (25)
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/component.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/bundler/bundler.go
  • pkg/bundler/config/config.go
  • pkg/bundler/config/config_test.go
  • pkg/bundler/deployer/deployer.go
  • pkg/bundler/deployer/flux/doc.go
  • pkg/bundler/deployer/flux/flux.go
  • pkg/bundler/deployer/flux/flux_test.go
  • pkg/bundler/deployer/flux/helm.go
  • pkg/bundler/deployer/flux/sources.go
  • pkg/bundler/deployer/flux/templates/README.md.tmpl
  • pkg/bundler/deployer/flux/templates/chart.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/gitrepo-source.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/helmrelease.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/helmrepo-source.yaml.tmpl
  • pkg/bundler/deployer/flux/templates/kustomization.yaml.tmpl
  • pkg/bundler/deployer/flux/testdata/helm_components/cert-manager/helmrelease.yaml
  • pkg/bundler/deployer/flux/testdata/helm_components/gpu-operator/helmrelease.yaml
  • pkg/bundler/deployer/flux/testdata/helm_components/kustomization.yaml
  • pkg/cli/bundle.go
  • tests/chainsaw/cli/bundle-flux/chainsaw-test.yaml

Comment thread docs/user/cli-reference.md
Comment thread pkg/bundler/deployer/flux/doc.go Outdated
Comment thread pkg/bundler/deployer/flux/flux_test.go
Comment thread pkg/bundler/deployer/flux/flux_test.go
Comment thread pkg/bundler/deployer/flux/flux_test.go
Comment thread pkg/bundler/deployer/flux/flux.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/bundler/deployer/flux/helm.go (1)

125-133: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve relative manifest paths to prevent silent template overwrites.

Line 125 uses filepath.Base(name), so a/deploy.yaml and b/deploy.yaml both become templates/deploy.yaml; the later write overwrites the former and drops resources.

Suggested fix
-		safeName := filepath.Base(name)
-		filePath, joinErr := deployer.SafeJoin(templatesDir, safeName)
+		cleanName := filepath.Clean(name)
+		filePath, joinErr := deployer.SafeJoin(templatesDir, cleanName)
 		if joinErr != nil {
 			return joinErr
 		}
+		if err := os.MkdirAll(filepath.Dir(filePath), 0750); err != nil {
+			return errors.Wrap(errors.ErrCodeInternal,
+				fmt.Sprintf("failed to create template subdirectory for %s", compName), err)
+		}
 		if err := os.WriteFile(filePath, content, 0600); err != nil {
 			return errors.Wrap(errors.ErrCodeInternal,
-				fmt.Sprintf("failed to write template %s for %s", safeName, compName), err)
+				fmt.Sprintf("failed to write template %s for %s", cleanName, compName), err)
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/bundler/deployer/flux/helm.go` around lines 125 - 133, The code currently
uses filepath.Base(name) to derive safeName which collapses relative paths
(e.g., a/deploy.yaml and b/deploy.yaml) and causes silent overwrites when
writing via deployer.SafeJoin(templatesDir, safeName); instead preserve the
relative path components (sanitized) when constructing the destination path so
each manifest remains unique: replace the filepath.Base(name) usage that sets
safeName with logic that cleans and joins the original relative path (while
preventing path traversal) before calling deployer.SafeJoin, then continue to
use os.WriteFile(filePath, content, 0600) and include compName in the error wrap
as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/bundler/deployer/flux/helm.go`:
- Around line 125-133: The code currently uses filepath.Base(name) to derive
safeName which collapses relative paths (e.g., a/deploy.yaml and b/deploy.yaml)
and causes silent overwrites when writing via deployer.SafeJoin(templatesDir,
safeName); instead preserve the relative path components (sanitized) when
constructing the destination path so each manifest remains unique: replace the
filepath.Base(name) usage that sets safeName with logic that cleans and joins
the original relative path (while preventing path traversal) before calling
deployer.SafeJoin, then continue to use os.WriteFile(filePath, content, 0600)
and include compName in the error wrap as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5acce34c-90eb-412d-8a88-98bf43ac4b3d

📥 Commits

Reviewing files that changed from the base of the PR and between 57ace34 and 2522032.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/flux/helm.go

@haarchri

Copy link
Copy Markdown
Contributor Author

have the 2 special cases working - compName did the trick that the inline helm charts are working:

source ~/.gvm/scripts/gvm && gvm use go1.26.2
export AICR_BIN="/Users/haarchri/Documents/aicr/dist/aicr_darwin_arm64_v8.0/aicr"
${AICR_BIN} recipe --service eks --accelerator h100 --os ubuntu \
  --intent training -o /tmp/flux-bundle/recipe.yaml
${AICR_BIN} bundle -r /tmp/flux-bundle/recipe.yaml -o /tmp/flux-bundle \
  --deployer flux --repo https://github.com/haarchri/aicr-flux-bundle

its available here: https://github.com/haarchri/aicr-flux-bundle

kubectl get helmreleases -A
[...]
flux-system   nodewright-customizations       1m    True    Helm upgrade succeeded for release skyhook/skyhook-nodewright-customizations.v2 with chart [email protected]
[...]
flux-system   gpu-operator-post               1m    True    Helm install succeeded for release gpu-operator/gpu-operator-gpu-operator-post.v1 with chart [email protected]

https://github.com/haarchri/aicr-flux-bundle/blob/main/gpu-operator-post/templates/dcgm-exporter.yaml

kubectl get cm -n gpu-operator         dcgm-exporter     -o yaml
apiVersion: v1
data:
  dcgm-metrics.csv: |-
    # Clocks,,
    DCGM_FI_DEV_SM_CLOCK,     gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

    # Temperature,,
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

    # Power,,
    DCGM_FI_DEV_POWER_USAGE,  gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

    # PCIe,,
    DCGM_FI_PROF_PCIE_TX_BYTES,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    DCGM_FI_PROF_PCIE_RX_BYTES,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

    # Utilization (the sample period varies depending on the product),,
    DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL,      gauge, Decoder utilization (in %).

    # Errors and violations,,
    DCGM_FI_DEV_XID_ERRORS,            gauge, Value of the last XID error encountered.
    DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
    DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
    DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
    DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
    DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

    # Memory usage,,
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

    # Retired pages,,
    DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
    DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
    DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

    # NVLink,,
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
    DCGM_FI_PROF_NVLINK_TX_BYTES,       counter, The rate of data transmitted over NVLink not including protocol headers in bytes per second.
    DCGM_FI_PROF_NVLINK_RX_BYTES,       counter, The rate of data received over NVLink not including protocol headers in bytes per second.

    # Add DCP metrics,,
    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
    DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
    DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

    # BCP Additional metrics
    DCGM_FI_DEV_GPU_NVLINK_ERRORS,      gauge, Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS.

    # Added RunAI Additional metrics https://docs.run.ai/latest/developer/metrics/metrics-api/#advanced-metrics
    ## NVLink
    DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
    ## VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
    ## Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
    ## Static configuration information. These appear as labels on the other metrics
    DCGM_FI_DRIVER_VERSION, label, Driver Version
    ## Profiling metrics
    DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: gpu-operator-gpu-operator-post
    meta.helm.sh/release-namespace: gpu-operator
  creationTimestamp: "2026-05-11T14:20:39Z"
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.sh/chart: gpu-operator-post-0.1.0
    helm.toolkit.fluxcd.io/name: gpu-operator-post
    helm.toolkit.fluxcd.io/namespace: flux-system
  name: dcgm-exporter
  namespace: gpu-operator
  resourceVersion: "55664"
  uid: e78046f1-63b6-4e27-b78c-cf4b726226be

https://github.com/haarchri/aicr-flux-bundle/blob/main/nodewright-customizations/templates/tuning.yaml

kubectl get skyhooks tuning -o yaml
apiVersion: skyhook.nvidia.com/v1alpha1
kind: Skyhook
metadata:
  annotations:
    helm.sh/hook: post-install,post-upgrade
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/hook-weight: "10"
    skyhook.nvidia.com/version: v0.15.0
  creationTimestamp: "2026-05-11T14:32:45Z"
  finalizers:
  - skyhook.nvidia.com/skyhook
  generation: 1
  labels:
    app.kubernetes.io/created-by: aicr
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: skyhook-operator
    helm.sh/chart: nodewright-customizations-0.1.0
  name: tuning
  resourceVersion: "59868"
  uid: a5f24918-86ce-4112-b501-b300d693fcac
spec:
  additionalTolerations:
  - operator: Exists
  autoTaintNewNodes: true
  interruptionBudget: {}
  nodeSelectors: {}
  packages:
    nvidia-setup-full:
      configMap:
        accelerator: h100
        service: eks
      dependsOn:
        nvidia-tuned: 0.3.0
      env:
      - name: NVIDIA_SETUP_INSTALL_KERNEL
        value: "false"
      image: ghcr.io/nvidia/nodewright-packages/nvidia-setup
      interrupt:
        type: reboot
      name: nvidia-setup-full
      resources:
        cpuLimit: "4"
        cpuRequest: "2"
        memoryLimit: 8Gi
        memoryRequest: 4Gi
      version: 0.2.2
    nvidia-setup-kernel:
      configMap:
        accelerator: h100
        service: eks
      env:
      - name: NVIDIA_SETUP_INSTALL_KERNEL
        value: "true"
      - name: NVIDIA_SETUP_KERNEL_ALLOW_NEWER
        value: "false"
      image: ghcr.io/nvidia/nodewright-packages/nvidia-setup
      name: nvidia-setup-kernel
      resources:
        cpuLimit: "4"
        cpuRequest: "2"
        memoryLimit: 8Gi
        memoryRequest: 4Gi
      version: 0.2.2
    nvidia-tuned:
      configMap:
        accelerator: h100
        intent: multiNodeTraining
        service: eks
      dependsOn:
        nvidia-setup-kernel: 0.2.2
      env:
      - name: INTERRUPT
        value: "true"
      image: ghcr.io/nvidia/nodewright-packages/nvidia-tuned
      interrupt:
        type: reboot
      name: nvidia-tuned
      version: 0.3.0
  podNonInterruptLabels: {}
  priority: 200
  runtimeRequired: true
  sequencing: node
  serial: false
[...]

@haarchri

Copy link
Copy Markdown
Contributor Author

validated also flux deployer with --dynamic + bake-time CLI flags

Generated a full flux bundle: haarchri/aicr-flux-bundle#1
used the following command:

aicr bundle \
  -r recipe.yaml \
  --deployer flux \
  --repo https://github.com/haarchri/aicr-flux-bundle \
  --dynamic gpuoperator:driver.version \
  --dynamic gpuoperator:driver.rdma.enabled \
  --accelerated-node-selector nvidia.com/gpu.present=true \
  --workload-selector app.kubernetes.io/name=training-job \
  -o /tmp/flux-bundle

@haarchri haarchri marked this pull request as ready for review May 11, 2026 16:04
@haarchri haarchri requested a review from a team as a code owner May 11, 2026 16:04
@haarchri haarchri force-pushed the feature/bundle-flux branch from 051dc74 to 00f129e Compare May 11, 2026 16:07
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid first contribution — clean implementation, follows existing deployer patterns (compile-time deployer.Deployer check, pkg/errors codes, SafeJoin everywhere, table-driven tests), and the chainsaw coverage is thorough.

Pragmatic merge view — CodeRabbit has 10 open inline findings. Some of them are actually good ;)

My read of the must-fix-before-merge subset:

  1. Branding sweep — "Generated by Cloud Native Stack" / "Cloud Native Stack recipes" still in 8 files (doc.go + 6 templates + 3 testdata). One commit, global replace.
  2. flux.go:213 ctx check in loop — project rule, one line.
  3. pkg/bundler/deployer/flux/helm.go:192 basename flatten — not a bug with today's data but a real foot-gun; ~5 lines.
  4. deployer.go:23 typoargocdhelmargocd-helm.

Nice-to-have, not blocking: test ergonomics (timeout-bounded ctx, errors.Is(err, context.Canceled), os.Stat error checks), writeTemplate using PropagateOrWrap, stronger dependsOn chainsaw assertion. Address as a follow-up if you want this in fast.

Disagreeing with CR on the ComponentTypeKustomize "major" — registry has zero Kustomize components today, so the default-reject is fine. Document the limitation in doc.go and move on.

CI: tests / CLI E2E is failing on argocd/014-prometheus-adapter/values.yaml yamllint indentation — that's in argocd output, not new flux code. Looks unrelated/possibly pre-existing flake (last on-push run on main was green though). Please rerun once you push the cleanup commit; if it persists on the next push without prometheus-adapter changes, file a separate issue and we shouldn't block this PR.

LGTM after the four must-fix items above.

Comment thread pkg/bundler/deployer/flux/templates/helmrelease.yaml.tmpl Outdated
Comment thread pkg/bundler/deployer/flux/flux.go
Comment thread pkg/bundler/deployer/flux/flux.go
Comment thread pkg/bundler/deployer/flux/helm.go Outdated
@mchmarny

Copy link
Copy Markdown
Member

@haarchri, this is not directly related to this PR, if can though, PTAL at #843. It's little Argo CD specific right now but we will need to close this gap for all deployers so your opinion there re Flux would be much appreciated.

@haarchri haarchri force-pushed the feature/bundle-flux branch from 0ad80a4 to a84beec Compare May 12, 2026 18:55
@haarchri

Copy link
Copy Markdown
Contributor Author

wonder what i need todo because of #846 is in ...

@haarchri haarchri requested a review from mchmarny May 12, 2026 18:57
@haarchri

Copy link
Copy Markdown
Contributor Author

@mchmarny

@haarchri, this is not directly related to this PR, if can though, PTAL at #843. It's little Argo CD specific right now but we will need to close this gap for all deployers so your opinion there re Flux would be much appreciated.

will add my points to the original #843

@haarchri

Copy link
Copy Markdown
Contributor Author

reagarding #846 have a PR for it: haarchri/aicr-flux-bundle#4 and will push the code soon

@haarchri

Copy link
Copy Markdown
Contributor Author

@mchmarny would love to finish this PR - before i need to add more and more on top

@mchmarny

Copy link
Copy Markdown
Member

@mchmarny would love to finish this PR - before i need to add more and more on top

Just saw your latest commits. I think we are there, assuming the tests pass.

@mchmarny mchmarny enabled auto-merge (squash) May 12, 2026 22:27
@mchmarny

Copy link
Copy Markdown
Member

@haarchri PR is ready to merge, you will need to amend your commits:

Commits must have verified signatures.

auto-merge was automatically disabled May 13, 2026 07:37

Head branch was pushed to by a user without write access

@haarchri haarchri force-pushed the feature/bundle-flux branch from 5d25748 to 9a88027 Compare May 13, 2026 07:37
@haarchri

Copy link
Copy Markdown
Contributor Author

@mchmarny rebased and set correct signing

@mchmarny mchmarny enabled auto-merge (squash) May 13, 2026 10:11
@mchmarny mchmarny merged commit 8d44545 into NVIDIA:main May 13, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(bundler): add Flux CD deployer type

2 participants