fix(bundler): recipe-set scheduling paths override CLI defaults (#982)#1082
Merged
Conversation
The kind overlay set per-operand `affinity` blocks on gpu-operator
(`dcgmExporter`, `devicePlugin`, `gfd`, `validator`) to keep operand
DaemonSets off kind control-plane nodes. `ClusterPolicy.spec.<operand>`
does not expose `affinity`, so the chart silently dropped these keys and
the exclusion was never enforced — operand DaemonSets could land on the
NFD-labeled control-plane and hang.
Switch kind.yaml to a mechanism the chart actually consumes:
`daemonsets.tolerations: []` removes the chart-default catch-all that
tolerates `node-role.kubernetes.io/control-plane:NoSchedule`, restoring
the taint as the real exclusion. For this to stick, the bundler now
treats recipe-overlay-set scheduling paths as authoritative — a snapshot
captured before injection runs blocks CLI/config defaults from
overwriting them. Internal writes within `applyNodeSchedulingOverrides`
are NOT in the snapshot, so the documented system → accelerated
overwrite for shared paths (e.g., NFD `worker.tolerations`) still works.
Incidentally fixes bcm.yaml, whose carefully crafted
`controller.tolerations` for BCM masters was silently replaced by the
CLI-default `{Operator: Exists}`.
Also adds `runtime.GOOS \!= "linux"` guards to the os-collector
integration tests that read `/proc/cmdline`, `/proc/sys`,
`/proc/modules`, and `/usr/lib/os-release`. The collectors degrade via
`slog.Warn` on missing files rather than returning errors, so the
existing `errors.Is(err, os.ErrNotExist)` skip branch never fired on
macOS and the tests failed locally.
Fixes #982
This comment was marked as resolved.
This comment was marked as resolved.
Contributor
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
PR review feedback (#1082): the previous snapshot captured every scheduling path populated in the merged values map, which silently included paths set by a component's default valuesFile. Four components (kai-scheduler, kueue, network-operator, aws-efa) ship chart-default tolerations in their values.yaml that collide with registry-declared scheduling paths. Treating those as authoritative turned --system-node-toleration / --accelerated-node-toleration into a no-op for those components — an unintended CLI regression. Compute the authoritative set from ComponentRef.Overrides + --set only. Pass it explicitly into applyNodeSchedulingOverrides instead of re-snapshotting from the merged values map. Adds: - TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths now binds the tolPath registry assertion (catches silent regression if daemonsets.tolerations is renamed) and adds a regression-guard sub-test proving a values-file default does NOT lock the path against CLI flags. - TestAuthoritativeSchedulingPaths covers nested overlay overrides, --set exact matches, explicit nil leaf (treated as set, matching empty-slice opt-out semantics), and the values-file-only case (not captured). Verified end-to-end: - kind: daemonsets.tolerations: [] preserved - bcm: controller.tolerations BCM-master list preserved - aks with --system-node-toleration: network-operator operator.tolerations correctly overrides the values.yaml default make qualify passes (22/22 chainsaw, 0 lint issues, 0 vulnerabilities).
CodeRabbit review feedback (#1082): two integration tests (TestSysctlCollector_ExcludesNet, TestSysctlCollector_MixedContent) also call collector.Collect(ctx) which reads /proc/sys but were missing the skipIfNotLinux guard. The collector degrades via slog.Warn on missing files rather than returning os.ErrNotExist, so the existing error-skip branch doesn't fire on macOS/Windows and the assertions pass loosely rather than skipping. Apply the same helper used by the other Linux-only integration tests in this package.
This comment was marked as resolved.
This comment was marked as resolved.
CI feedback (#1082): the previous "skip injection when overlay set the path" policy preserved bcm.yaml's `controller.tolerations` but starved the controller of CLI tolerations needed to schedule on KWOK system nodes (`kwok.x-k8s.io/node=fake:NoSchedule`). KWOK Tier 1 bcm/bcm-helm jobs failed with "1 pods could not be scheduled to nodes". Replace the binary "authoritative or not" flag with a per-path policy: - opt-out : overlay sets the path to an empty list/map/nil → skip injection entirely (kind.yaml's `daemonsets.tolerations: []` keeps gpu-operator off the control-plane via the chart default). - append : overlay or --set sets the path to a non-empty value → UNION the overlay value with CLI tolerations (bcm.yaml's BCM-master + CLI's kwok-fake). - replace : default (no overlay, may have values-file value) → existing REPLACE semantic preserved, so the NFD shared-path system → accelerated overwrite for `worker.tolerations` still produces "accelerated wins". New helpers: - component.AppendTolerationsOverrides(values, tols, paths…) - bundler.classifySchedulingPaths(overrides, setOverrides, paths) - bundler.schedulingPathPolicy{optOut, appendMode} Verified end-to-end: - kind: `daemonsets.tolerations: []` preserved - bcm: controller.tolerations = [BCM-master, BCM-control-plane, kwok-fake] (UNION with --system-node-toleration) - eks/cuj1: NFD worker.tolerations REPLACE-overwritten by accelerated CLI value (cuj1-training contract preserved) - kueue-style values-file default: still overwritten by CLI (reviewer concern: --system-node-toleration never silently becomes a no-op) Test coverage: - TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths: opt-out / append / replace / values-file matrix - TestClassifySchedulingPaths: policy derivation - TestAppendTolerationsOverrides: append helper behavior including non-slice existing value
This comment was marked as resolved.
This comment was marked as resolved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stop the kind overlay from emitting unsupported
affinityblocks ongpu-operator operands, and make the bundler treat recipe-overlay-set
scheduling paths as authoritative so a deliberate empty list (or any
explicit value) is no longer silently clobbered by CLI defaults.
Motivation / Context
The kind overlay set per-operand
affinityondcgmExporter,devicePlugin,gfd, andvalidatorto keep GPU Operator operandDaemonSets off kind control-plane nodes.
ClusterPolicy.spec.<operand>does not expose
affinity, so the gpu-operator chart silently dropsthose keys; combined with NFD labeling the control-plane (host kernel
exposes the NVIDIA PCI device) the operand DaemonSets land on the
control-plane and hang.
While verifying the obvious fix (
daemonsets.tolerations: []— a keythe chart does consume), it became clear the bundler was overwriting
that empty list with the CLI-default
{Operator: Exists}toleration,so the kind exclusion would never have taken effect anyway. The same
silent overwrite was also discarding bcm.yaml's careful BCM-master
controller.tolerations.Fixes: #982
Related: bcm overlay regression (no separate issue filed)
Type of Change
Component(s) Affected
pkg/recipe) — kind overlaypkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter) — Linux-only test guardsImplementation Notes
kind overlay (
recipes/overlays/kind.yaml)affinityblocks (silently dropped by the chart).daemonsets.tolerations: []to remove the catchall tolerationthat lets DaemonSets tolerate
node-role.kubernetes.io/control-plane:NoSchedule.Bundler precedence (
pkg/bundler/bundler.go)applyNodeSchedulingOverridesnow snapshots scheduling paths populatedbefore injection runs (i.e., by the recipe overlay or
--setmergedinto values) and skips CLI/config-default injection for those. Internal
writes within the function are not in the snapshot, so the
documented system → accelerated overwrite for shared paths (e.g.,
NFD
worker.tolerations) still works:daemonsets.tolerations: [][{Operator: Exists}][](chart default kicks in)controller.tolerations: [...][{Operator: Exists}][{nvidia.com/gpu, …}]✓[{nvidia.com/gpu, …}]✓worker.tolerations(system + accel)Linux-only test guards (
pkg/collector/os/)The os collectors (grub/sysctl/kmod/release) read Linux paths and
degrade via
slog.Warninstead of returning errors, so the existingerrors.Is(err, os.ErrNotExist)skip branches never fired on macOS andnine integration tests failed locally. Added a shared
skipIfNotLinuxhelper and applied it to the affected tests.
Testing
End-to-end bundle verification:
aicr bundle -r kind-recipe.yaml→daemonsets.tolerations: []preservedaicr bundle -r bcm-recipe.yaml→controller.tolerationsBCM-master list preservedaicr bundle -r eks-recipe.yaml --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule→ NFDworker.tolerationscorrectly set to the GPU tolerationUnit test coverage added:
TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths(kind/bcm/default scenarios) andTestPreExistingSchedulingPaths/TestFilterPaths(helper functions).Risk Assessment
kind.yamlandbcm.yamlcurrently set scheduling paths explicitly, both intentional; all other overlays receive identical behavior to before. Existingcuj1-training-workloadchainsaw test (which depends on shared-path overwrite) passes.Rollout notes: N/A — pure bug fix, no migration. Users whose overlays set scheduling values explicitly will now see those values honored (previously silently discarded).
Checklist
make testwith-race)make lint)git commit -S)