Skip to content

feat(recipe): add NFD as standalone shared component#518

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component
Apr 23, 2026
Merged

feat(recipe): add NFD as standalone shared component#518
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component

Conversation

@ArangoGutierrez

@ArangoGutierrez ArangoGutierrez commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component
  • Both gpu-operator and network-operator now depend on this shared NFD instance via dependencyRefs
  • Network-operator gains NIC labeling via NodeFeatureRule CR (deployNodeFeatureRules: true)
  • NFD becomes a base-level component — every recipe inherits it automatically

Changes

Area Change
recipes/registry.yaml Add nfd component with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operator
recipes/components/nfd/values.yaml New file — Master + Worker + GC enabled, TopologyUpdater off
recipes/overlays/base.yaml Add nfd componentRef before cert-manager
recipes/overlays/*.yaml Add nfd to gpu-operator/network-operator dependencyRefs on every overlay that re-declares them (including rtx-pro-6000-lke-{inference,training})
recipes/components/gpu-operator/values.yaml nfd.enabled: false (disable sub-chart)
recipes/components/network-operator/values.yaml nfd.deployNodeFeatureRules: true
recipes/overlays/kind.yaml Remove nfd.enabled: true override (no longer needed)
recipes/checks/nfd/health-check.yaml Chainsaw health check: Master Deployment + Worker DaemonSet + GC Deployment + pod phases

Design

Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.

Resolved deployment order (from aicr query --selector deploymentOrder): cert-manager → kube-prometheus-stack → k8s-ephemeral-storage-metrics → nfd → gpu-operator → kai-scheduler → … nfd and cert-manager are independent components; the dependencyRefs on gpu-operator/network-operator only require nfd to be up before those consumers start.

Verification

  • make test — all packages pass with race detector (72.5% coverage)
  • make lint (golangci-lint v2.10.1) — 0 Go issues
  • make build — builds successfully
  • Recipe resolution: NFD appears in all resolved recipes
  • aicr query confirms gpu-operator.values.nfd.enabled: false
  • aicr query confirms network-operator.values.nfd.deployNodeFeatureRules: true
  • aicr query confirms nfd ∈ gpu-operator.dependencyRefs for rtx-pro-6000-lke-{inference,training}
  • gpu-operator and network-operator both have nfd in dependencyRefs

Related

Track B of NFD Adoption (Track A — NFD snapshot enrichment — completed 2026-04-08).

@ArangoGutierrez ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:36
@ArangoGutierrez ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:48
@mchmarny mchmarny added this to the v0.12 milestone Apr 14, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, declarative PR that follows the cert-manager shared-component pattern. No Go code changes needed — the registry + overlay system handles everything. Good separation of concerns.

Issues to address:

  1. Missing overlaysrtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml re-declare gpu-operator's dependencyRefs without nfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes.

  2. Health check missing GC — The values enable gc.enable: true and the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.

Minor:

  • PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.

Looks good:

  • Worker excluded from accelerated nodeSelectorPaths so it runs everywhere — correct and well-commented
  • nfd.enabled: false in gpu-operator + deployNodeFeatureRules: true in network-operator is the right combination
  • kind overlay cleanup (removing nfd.enabled: true override) is correct
  • Old NFD assertions removed from assert-gpu-operator.yaml since they'd check the wrong namespace now
  • Registry entry is well-structured with proper nodeScheduling paths

Comment thread recipes/overlays/base.yaml
Comment thread recipes/checks/nfd/health-check.yaml Outdated
Comment thread recipes/registry.yaml
Comment thread recipes/components/gpu-operator/values.yaml
Comment thread recipes/components/nfd/values.yaml
Comment thread recipes/components/network-operator/values.yaml
Comment thread recipes/overlays/kind.yaml
Comment thread tests/chainsaw/ai-conformance/offline/assert-recipe.yaml
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 55eb6c4 to 9892d40 Compare April 21, 2026 20:46
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 21, 2026
…ke overlays

Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 21, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@coderabbitai

This comment was marked as resolved.

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchmarny thanks for the thorough review. Rebased on latest main and addressed all three items:

1. Missing nfd in dependencyRefs — fixed in 4baeb50
Added nfd to gpu-operator.dependencyRefs in both rtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml. I also audited every overlay that declares dependencyRefs: on gpu-operator or network-operator and confirmed those two were the only regressions — the rest already include nfd. Verified end-to-end:

$ aicr query --service lke --accelerator rtx-pro-6000 --intent training \
    --selector 'components.gpu-operator.dependencyRefs'
- nfd
- cert-manager
- kube-prometheus-stack

2. Health check missing GC — fixed in 9892d40
Added a validate-gc-deployment step asserting node-feature-discovery-gc has availableReplicas > 0, matching the coverage the old gpu-operator assertions had.

3. PR description install order — fixed
Updated the PR body. The actual resolved deploymentOrder puts nfd after cert-manager; since they're independent (no dependencyRefs between them) that's cosmetic ordering, but the description was misleading and now reflects reality.

The branch also dropped three stale Merge branch 'main' commits during the rebase so the history is now two feature commits on top of the two review-response fixes.

coderabbitai[bot]

This comment was marked as resolved.

ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
…ke overlays

Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 20fc985 to f8c9690 Compare April 22, 2026 06:35
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
…ke overlays

Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 1c0aef8 to d4892b6 Compare April 22, 2026 14:41
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
…ke overlays

Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 22, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from d4892b6 to f3e9058 Compare April 22, 2026 19:41
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 23, 2026
…ke overlays

Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 23, 2026
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from f3e9058 to d10dcca Compare April 23, 2026 10:26
@mchmarny mchmarny removed this from the v0.12 milestone Apr 23, 2026
@mchmarny mchmarny added this to the v0.12 milestone Apr 23, 2026
Extract Node Feature Discovery from gpu-operator's sub-chart and deploy
it as a top-level component. Both gpu-operator and network-operator
declare dependencyRefs: [nfd] so install ordering is explicit, and every
recipe that includes either operator now consumes a single shared NFD
instance instead of operator-embedded sub-charts. Follows the
cert-manager shared-component pattern.

Registry + values:
  - New top-level entry in recipes/registry.yaml with nodeScheduling
    paths for master/gc (system nodes) and worker (all nodes via both
    system AND accelerated tolerations so it lands on tainted nodes
    regardless of role).
  - recipes/components/nfd/values.yaml pins master.enable and
    worker.enable explicitly so behavior doesn't drift with upstream
    chart defaults; GC enabled; Topology Updater off by default.
  - recipes/components/gpu-operator/values.yaml: nfd.enabled: false.
  - recipes/components/network-operator/values.yaml:
    nfd.deployNodeFeatureRules: true so its NFD rules ship to the
    shared instance.

Overlays:
  - recipes/overlays/base.yaml wires nfd into every recipe.
  - recipes/overlays/rtx-pro-6000-lke-{training,inference}.yaml
    explicitly add nfd to gpu-operator.dependencyRefs where it was
    re-declared (overlay arrays replace, not merge) so ordering is
    preserved.
  - recipes/overlays/kind.yaml drops the no-longer-needed
    nfd.enabled: true override on gpu-operator.

Health check (recipes/checks/nfd/health-check.yaml):
  - Master Deployment ≥1 ready replica.
  - Worker DaemonSet fully rolled out — desiredNumberScheduled > 0 AND
    numberUnavailable == 0 (replaces the weaker numberReady > 0 which
    passed on partial failures).
  - GC Deployment ≥1 ready replica.
  - Pod phase errors on Pending/Failed/Unknown.
  - Container waiting-reason errors for CrashLoopBackOff,
    ImagePullBackOff, ErrImagePull, CreateContainerConfigError — catches
    pods stuck in Running-phase with unhealthy containers.

Tests:
  - tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling-nfd.yaml:
    companion CUJ1 fixture asserts master/GC on system nodes, worker
    with accelerated toleration, and worker.nodeSelector absent as an
    explicit regression guard.
  - tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml: new step wires
    the standalone-NFD scheduling assertion into CUJ1-training.
  - tests/chainsaw/ai-conformance: NFD assertions moved out of
    gpu-operator sub-chart scope.

Design: docs/plans/2026-04-09-nfd-standalone-component-design.md
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from d10dcca to 1b26cf5 Compare April 23, 2026 12:31
@mchmarny mchmarny enabled auto-merge (squash) April 23, 2026 13:11
@mchmarny mchmarny merged commit 7f91140 into NVIDIA:main Apr 23, 2026
77 of 78 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants