Skip to content

fix(bundler): recipe-set scheduling paths override CLI defaults (#982)#1082

Merged
mchmarny merged 5 commits into
mainfrom
fix/kind-overlay-gpu-operator-affinity
May 28, 2026
Merged

fix(bundler): recipe-set scheduling paths override CLI defaults (#982)#1082
mchmarny merged 5 commits into
mainfrom
fix/kind-overlay-gpu-operator-affinity

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Stop the kind overlay from emitting unsupported affinity blocks on
gpu-operator operands, and make the bundler treat recipe-overlay-set
scheduling paths as authoritative so a deliberate empty list (or any
explicit value) is no longer silently clobbered by CLI defaults.

Motivation / Context

The kind overlay set per-operand affinity on dcgmExporter,
devicePlugin, gfd, and validator to keep GPU Operator operand
DaemonSets off kind control-plane nodes. ClusterPolicy.spec.<operand>
does not expose affinity, so the gpu-operator chart silently drops
those keys; combined with NFD labeling the control-plane (host kernel
exposes the NVIDIA PCI device) the operand DaemonSets land on the
control-plane and hang.

While verifying the obvious fix (daemonsets.tolerations: [] — a key
the chart does consume), it became clear the bundler was overwriting
that empty list with the CLI-default {Operator: Exists} toleration,
so the kind exclusion would never have taken effect anyway. The same
silent overwrite was also discarding bcm.yaml's careful BCM-master
controller.tolerations.

Fixes: #982
Related: bcm overlay regression (no separate issue filed)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — kind overlay
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter) — Linux-only test guards

Implementation Notes

kind overlay (recipes/overlays/kind.yaml)

  • Drop the four affinity blocks (silently dropped by the chart).
  • Set daemonsets.tolerations: [] to remove the catchall toleration
    that lets DaemonSets tolerate node-role.kubernetes.io/control-plane:NoSchedule.

Bundler precedence (pkg/bundler/bundler.go)

applyNodeSchedulingOverrides now snapshots scheduling paths populated
before injection runs (i.e., by the recipe overlay or --set merged
into values) and skips CLI/config-default injection for those. Internal
writes within the function are not in the snapshot, so the
documented system → accelerated overwrite for shared paths (e.g.,
NFD worker.tolerations) still works:

Scenario Pre-PR (buggy) Post-PR (correct)
kind: overlay sets daemonsets.tolerations: [] [{Operator: Exists}] [] (chart default kicks in)
bcm: overlay sets controller.tolerations: [...] [{Operator: Exists}] BCM-master list preserved
eks: no overlay, CLI passes GPU toleration [{nvidia.com/gpu, …}] [{nvidia.com/gpu, …}]
NFD: shared worker.tolerations (system + accel) accelerated wins ✓ accelerated wins ✓

Linux-only test guards (pkg/collector/os/)

The os collectors (grub/sysctl/kmod/release) read Linux paths and
degrade via slog.Warn instead of returning errors, so the existing
errors.Is(err, os.ErrNotExist) skip branches never fired on macOS and
nine integration tests failed locally. Added a shared skipIfNotLinux
helper and applied it to the affected tests.

Testing

make qualify
# Passed tests 22 / 0 failed / 0 skipped
# 0 issues (lint)
# No vulnerabilities found

End-to-end bundle verification:

  • aicr bundle -r kind-recipe.yamldaemonsets.tolerations: [] preserved
  • aicr bundle -r bcm-recipe.yamlcontroller.tolerations BCM-master list preserved
  • aicr bundle -r eks-recipe.yaml --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule → NFD worker.tolerations correctly set to the GPU toleration

Unit test coverage added: TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths (kind/bcm/default scenarios) and TestPreExistingSchedulingPaths/TestFilterPaths (helper functions).

Risk Assessment

  • Medium — Behavior change in bundler precedence touches a code path used by every overlay. Audited: only kind.yaml and bcm.yaml currently set scheduling paths explicitly, both intentional; all other overlays receive identical behavior to before. Existing cuj1-training-workload chainsaw test (which depends on shared-path overwrite) passes.

Rollout notes: N/A — pure bug fix, no migration. Users whose overlays set scheduling values explicitly will now see those values honored (previously silently discarded).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed — N/A, no user-visible flag/API change
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The kind overlay set per-operand `affinity` blocks on gpu-operator
(`dcgmExporter`, `devicePlugin`, `gfd`, `validator`) to keep operand
DaemonSets off kind control-plane nodes. `ClusterPolicy.spec.<operand>`
does not expose `affinity`, so the chart silently dropped these keys and
the exclusion was never enforced — operand DaemonSets could land on the
NFD-labeled control-plane and hang.

Switch kind.yaml to a mechanism the chart actually consumes:
`daemonsets.tolerations: []` removes the chart-default catch-all that
tolerates `node-role.kubernetes.io/control-plane:NoSchedule`, restoring
the taint as the real exclusion. For this to stick, the bundler now
treats recipe-overlay-set scheduling paths as authoritative — a snapshot
captured before injection runs blocks CLI/config defaults from
overwriting them. Internal writes within `applyNodeSchedulingOverrides`
are NOT in the snapshot, so the documented system → accelerated
overwrite for shared paths (e.g., NFD `worker.tolerations`) still works.

Incidentally fixes bcm.yaml, whose carefully crafted
`controller.tolerations` for BCM masters was silently replaced by the
CLI-default `{Operator: Exists}`.

Also adds `runtime.GOOS \!= "linux"` guards to the os-collector
integration tests that read `/proc/cmdline`, `/proc/sys`,
`/proc/modules`, and `/usr/lib/os-release`. The collectors degrade via
`slog.Warn` on missing files rather than returning errors, so the
existing `errors.Is(err, os.ErrNotExist)` skip branch never fired on
macOS and the tests failed locally.

Fixes #982
@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.2%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.2%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler 72.99% (+1.36%) 👍
github.com/NVIDIA/aicr/pkg/component 71.70% (+0.78%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/bundler.go 72.07% (+1.81%) 605 (+57) 436 (+51) 169 (+6) 👍
github.com/NVIDIA/aicr/pkg/component/overrides.go 96.83% (-0.45%) 126 (+16) 122 (+15) 4 (+1) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

mchmarny added 2 commits May 28, 2026 07:21
PR review feedback (#1082): the previous snapshot captured every
scheduling path populated in the merged values map, which silently
included paths set by a component's default valuesFile. Four components
(kai-scheduler, kueue, network-operator, aws-efa) ship chart-default
tolerations in their values.yaml that collide with registry-declared
scheduling paths. Treating those as authoritative turned
--system-node-toleration / --accelerated-node-toleration into a no-op
for those components — an unintended CLI regression.

Compute the authoritative set from ComponentRef.Overrides + --set only.
Pass it explicitly into applyNodeSchedulingOverrides instead of
re-snapshotting from the merged values map.

Adds:
- TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths now binds the
  tolPath registry assertion (catches silent regression if
  daemonsets.tolerations is renamed) and adds a regression-guard sub-test
  proving a values-file default does NOT lock the path against CLI flags.
- TestAuthoritativeSchedulingPaths covers nested overlay overrides, --set
  exact matches, explicit nil leaf (treated as set, matching empty-slice
  opt-out semantics), and the values-file-only case (not captured).

Verified end-to-end:
- kind: daemonsets.tolerations: [] preserved
- bcm: controller.tolerations BCM-master list preserved
- aks with --system-node-toleration: network-operator operator.tolerations
  correctly overrides the values.yaml default

make qualify passes (22/22 chainsaw, 0 lint issues, 0 vulnerabilities).
CodeRabbit review feedback (#1082): two integration tests
(TestSysctlCollector_ExcludesNet, TestSysctlCollector_MixedContent) also
call collector.Collect(ctx) which reads /proc/sys but were missing the
skipIfNotLinux guard. The collector degrades via slog.Warn on missing
files rather than returning os.ErrNotExist, so the existing error-skip
branch doesn't fire on macOS/Windows and the assertions pass loosely
rather than skipping. Apply the same helper used by the other Linux-only
integration tests in this package.
@coderabbitai

This comment was marked as resolved.

mchmarny and others added 2 commits May 28, 2026 07:41
CI feedback (#1082): the previous "skip injection when overlay set the
path" policy preserved bcm.yaml's `controller.tolerations` but starved
the controller of CLI tolerations needed to schedule on KWOK system
nodes (`kwok.x-k8s.io/node=fake:NoSchedule`). KWOK Tier 1 bcm/bcm-helm
jobs failed with "1 pods could not be scheduled to nodes".

Replace the binary "authoritative or not" flag with a per-path policy:

  - opt-out  : overlay sets the path to an empty list/map/nil → skip
               injection entirely (kind.yaml's
               `daemonsets.tolerations: []` keeps gpu-operator off the
               control-plane via the chart default).
  - append   : overlay or --set sets the path to a non-empty value →
               UNION the overlay value with CLI tolerations
               (bcm.yaml's BCM-master + CLI's kwok-fake).
  - replace  : default (no overlay, may have values-file value) →
               existing REPLACE semantic preserved, so the NFD
               shared-path system → accelerated overwrite for
               `worker.tolerations` still produces "accelerated wins".

New helpers:
  - component.AppendTolerationsOverrides(values, tols, paths…)
  - bundler.classifySchedulingPaths(overrides, setOverrides, paths)
  - bundler.schedulingPathPolicy{optOut, appendMode}

Verified end-to-end:
  - kind: `daemonsets.tolerations: []` preserved
  - bcm: controller.tolerations = [BCM-master, BCM-control-plane,
         kwok-fake] (UNION with --system-node-toleration)
  - eks/cuj1: NFD worker.tolerations REPLACE-overwritten by accelerated
              CLI value (cuj1-training contract preserved)
  - kueue-style values-file default: still overwritten by CLI (reviewer
    concern: --system-node-toleration never silently becomes a no-op)

Test coverage:
  - TestApplyNodeSchedulingOverrides_RespectsRecipeSetPaths: opt-out /
    append / replace / values-file matrix
  - TestClassifySchedulingPaths: policy derivation
  - TestAppendTolerationsOverrides: append helper behavior including
    non-slice existing value
@github-actions github-actions Bot added size/XL and removed size/L labels May 28, 2026
@coderabbitai

This comment was marked as resolved.

@mchmarny mchmarny enabled auto-merge (squash) May 28, 2026 15:48

@dims dims left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mchmarny mchmarny merged commit 3eab313 into main May 28, 2026
206 of 208 checks passed
@mchmarny mchmarny deleted the fix/kind-overlay-gpu-operator-affinity branch May 28, 2026 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: nvkind gpu-operator values set unsupported devicePlugin.affinity, allowing device plugin to schedule on control-plane

3 participants