feat(bundler): support dynamic toleration injection for pre-built OCI bundles#1471
Conversation
- Add NO_COLOR-aware color helpers (_ok, _fail, _warn_line) - Add _step_header/ok/fail/retry for per-component section framing - Show [N/M] progress counter and manual copy-paste command per step - Color-code pre-flight pass/fail and final deployment summary Closes NVIDIA#1362
Signed-off-by: Mohit Yadav <[email protected]>
Signed-off-by: Mohit Yadav <[email protected]>
- Re-indent attempt/while/done block inside for-loop body - Add env var caveat to Manual: step hint with printf -v quoting - Regenerate all 6 testdata goldens
When a toleration path is declared via --dynamic, skip baking in the CLI-supplied value during applyNodeSchedulingOverrides. Merging dynamic paths into policy.optOut causes the injection to be skipped, so cluster-values.yaml carries an empty placeholder instead of a baked-in toleration. Operators can then inject additional taints at install time without rebuilding the OCI bundle. Add dynamicPathSetFor() helper that resolves --dynamic override keys to component names via the registry and returns the declared path set. Closes NVIDIA#1371
|
Welcome to AICR, @mohityadav8! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe bundler now filters enabled components inline, rejects bundles when none remain enabled, and removes the old filtering helper. It also resolves dynamic value paths from the component registry, adds them to scheduling opt-out, and keeps those paths unset during node scheduling overrides. Agentgateway handling now emits warnings instead of rewriting source ranges. Tests cover alias-based dynamic path resolution and toleration opt-out behavior. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler_test.go`:
- Around line 2477-2544: The new tests in TestDynamicPathSetFor and
TestDynamicTolerationPathExcludedFromBakeIn use spaces instead of the file’s
tab-based Go formatting, which will fail fmt-check. Reformat the added test
blocks with gofmt (or make tidy) so the indentation matches the surrounding
code, and keep the test logic in place around dynamicPathSetFor,
computeSchedulingPathPolicy, and applyNodeSchedulingOverrides.
In `@pkg/bundler/bundler.go`:
- Around line 1143-1146: The registry-load failure in
applyNodeSchedulingOverrides is being swallowed when
GetComponentRegistryFor(provider) returns an error, which can silently disable
dynamic opt-out behavior. Update the applyNodeSchedulingOverrides path in
bundler.go to keep returning nil on error but also add a slog.Debug log,
matching the existing GetComponentRegistryFor failure handling, so the failure
is visible without changing behavior.
- Around line 1109-1132: Remove the duplicated storage-class injection block in
bundler.go within the component handling logic: the second copy of the
sc/comp.GetStorageClassPaths()/component.ApplyMapOverrides sequence is redundant
and should be deleted, leaving only the original implementation. While editing,
ensure the remaining code in that area uses the existing tab-based Go formatting
so the bundler logic around getValueOverridesForComponent and ApplyMapOverrides
still passes fmt checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 229b45ed-3222-4948-adcd-3c26498a11b7
📒 Files selected for processing (9)
pkg/bundler/bundler.gopkg/bundler/bundler_test.gopkg/bundler/deployer/helm/templates/deploy.sh.tmplpkg/bundler/deployer/helm/testdata/kai_scheduler_present/deploy.shpkg/bundler/deployer/helm/testdata/manifest_only/deploy.shpkg/bundler/deployer/helm/testdata/mixed_gpu_operator/deploy.shpkg/bundler/deployer/helm/testdata/mixed_with_pre/deploy.shpkg/bundler/deployer/helm/testdata/nodewright_present/deploy.shpkg/bundler/deployer/helm/testdata/upstream_helm_only/deploy.sh
There was a problem hiding this comment.
Thanks for this — the core approach is clean: merging dynamic paths into policy.optOut and letting the existing filterPaths machinery skip injection is exactly the right seam, no churn to applyNodeSchedulingOverrides, and optOut is always initialized so the merge is panic-safe.
Blockers before this can merge:
gofmt failure (inline) pkg/bundler/bundler_test.go is space-indented; gofmt -l flags it, so make qualify fails.
One medium concern inline on dynamicPathSetFor (silent fail-open on bad --dynamic keys), plus two nits:
- "empty placeholder" (PR description + the comment at the
optOutmerge): the path ends up absent from values, not set to an empty placeholder — the test correctly asserts absence, but the wording overstates it. - Worth a one-line note in
docs/user/bundling.mdthat--dynamicpaths are excluded from scheduling bake-in, since that's the user-visible behavior this enables.
None of the substance is in question — once the rebase shrinks the diff and gofmt passes, this should be quick to re-review.
- Add slog.Warn when GetByOverrideKey returns nil in dynamicPathSetFor so gaps in upstream validation surface immediately rather than silently baking in a toleration the user expected to leave dynamic - Fix gofmt tab indentation in new test functions
|
@mchmarny |
|
|
- Delete duplicate space-indented storage-class block causing gofmt failure - Fix optOut merge comment: path is absent from values, not an empty placeholder - Add note to docs/user/bundling.md on --dynamic scheduling exclusion
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/bundler.go`:
- Around line 228-252: The new filtering loops in bundler.go ignore cancellation
while iterating over recipeResult.ComponentRefs and DeploymentOrder, so add
ctx.Err() checks inside these loops and return early when the context is
canceled, consistent with extractComponentValues. Update the filtering logic in
the bundle-building path to stop processing once Make’s ctx is done, using the
existing ctx passed through the bundling flow and the same cancellation pattern
used elsewhere in the bundle work.
- Around line 238-244: The re-enable path in bundler logic still carries the
recipe-level enabled override into value generation, so a component restored by
`--set <component>:enabled=true` can still render with stale disabled state.
Update the enabled-ref handling in `bundler.go` where refs are added to
`enabledRefs`/`enabledSet` so that `enabled` is stripped from any ref that was
re-enabled before later value extraction, matching the existing cleanup done for
scalar `--set` values.
- Around line 1265-1270: The source-range check in the bundler helper that
iterates over items and calls netutil.IsAnySourceCIDR is treating malformed
CIDRs as scoped, so invalid values like not-a-cidr bypass the warning. Update
that logic to validate each entry first using net.ParseCIDR (or equivalent)
before calling IsAnySourceCIDR, and return false on parse failure; add the net
import if it is not already used. Keep the behavior in the same helper so the
warning is only suppressed for valid non-any source CIDRs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 29502f5a-8943-4072-9963-1667ef3cf757
📒 Files selected for processing (1)
pkg/bundler/bundler.go
|
Make sure that |
e5aac55 to
5ec97c6
Compare
Summary
Closes #1371
When AICR generates an OCI bundle, GPU node tolerations are baked in at
build time via
--accelerated-node-toleration. Once built, there was noway to inject additional tolerations at install time without rebuilding
the entire bundle — a blocker when customers add arbitrary taints to GPU
nodes (e.g.
reserved-by:NoScheduleduring application upgrades).What changed
pkg/bundler/bundler.goAdded
dynamicPathSetFor(componentName, provider)— resolves--dynamicoverride keys to component names via the registry andreturns the declared path set for that component.
In
extractComponentValues(), dynamic paths are merged intopolicy.optOutbeforeapplyNodeSchedulingOverridesruns. Theexisting
filterPaths()machinery already skips injection for opt-outpaths — so no changes to
applyNodeSchedulingOverridesitself wereneeded. The result:
cluster-values.yamlgets an empty placeholderinstead of a baked-in toleration value.
pkg/bundler/bundler_test.goTestDynamicPathSetFor— verifies the helper correctly resolvesoverride keys to component names and scopes paths per component.
TestDynamicTolerationPathExcludedFromBakeIn— verifies that when--dynamic gpuoperator:daemonsets.tolerationsis set alongside--accelerated-node-toleration, the toleration is NOT written intocomponent values.
How to use
daemonsets.tolerationswill be an empty placeholder incluster-values.yaml. At install time, inject the taint:Testing