feat(recipe): add NFD as standalone shared component by ArangoGutierrez · Pull Request #518 · NVIDIA/aicr

ArangoGutierrez · 2026-04-09T17:36:17Z

Summary

Extract Node Feature Discovery from gpu-operator's sub-chart into a standalone shared component
Both gpu-operator and network-operator now depend on this shared NFD instance via dependencyRefs
Network-operator gains NIC labeling via NodeFeatureRule CR (deployNodeFeatureRules: true)
NFD becomes a base-level component — every recipe inherits it automatically

Changes

Area	Change
`recipes/registry.yaml`	Add `nfd` component with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operator
`recipes/components/nfd/values.yaml`	New file — Master + Worker + GC enabled, TopologyUpdater off
`recipes/overlays/base.yaml`	Add `nfd` componentRef before cert-manager
`recipes/overlays/*.yaml`	Add `nfd` to gpu-operator/network-operator `dependencyRefs` on every overlay that re-declares them (including `rtx-pro-6000-lke-{inference,training}`)
`recipes/components/gpu-operator/values.yaml`	`nfd.enabled: false` (disable sub-chart)
`recipes/components/network-operator/values.yaml`	`nfd.deployNodeFeatureRules: true`
`recipes/overlays/kind.yaml`	Remove `nfd.enabled: true` override (no longer needed)
`recipes/checks/nfd/health-check.yaml`	Chainsaw health check: Master Deployment + Worker DaemonSet + GC Deployment + pod phases

Design

Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to node-feature-discovery namespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.

Resolved deployment order (from aicr query --selector deploymentOrder): cert-manager → kube-prometheus-stack → k8s-ephemeral-storage-metrics → nfd → gpu-operator → kai-scheduler → … nfd and cert-manager are independent components; the dependencyRefs on gpu-operator/network-operator only require nfd to be up before those consumers start.

Verification

make test — all packages pass with race detector (72.5% coverage)
make lint (golangci-lint v2.10.1) — 0 Go issues
make build — builds successfully
Recipe resolution: NFD appears in all resolved recipes
aicr query confirms gpu-operator.values.nfd.enabled: false
aicr query confirms network-operator.values.nfd.deployNodeFeatureRules: true
aicr query confirms nfd ∈ gpu-operator.dependencyRefs for rtx-pro-6000-lke-{inference,training}
gpu-operator and network-operator both have nfd in dependencyRefs

Missing overlays — rtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml re-declare gpu-operator's dependencyRefs without nfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes.
Health check missing GC — The values enable gc.enable: true and the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.

Minor:

PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.

Looks good:

Worker excluded from accelerated nodeSelectorPaths so it runs everywhere — correct and well-commented
nfd.enabled: false in gpu-operator + deployNodeFeatureRules: true in network-operator is the right combination
kind overlay cleanup (removing nfd.enabled: true override) is correct
Old NFD assertions removed from assert-gpu-operator.yaml since they'd check the wrong namespace now
Registry entry is well-structured with proper nodeScheduling paths

@mchmarny

…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

ArangoGutierrez

@mchmarny thanks for the thorough review. Rebased on latest main and addressed all three items:

1. Missing nfd in dependencyRefs — fixed in 4baeb50
Added nfd to gpu-operator.dependencyRefs in both rtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml. I also audited every overlay that declares dependencyRefs: on gpu-operator or network-operator and confirmed those two were the only regressions — the rest already include nfd. Verified end-to-end:

$ aicr query --service lke --accelerator rtx-pro-6000 --intent training \
    --selector 'components.gpu-operator.dependencyRefs'
- nfd
- cert-manager
- kube-prometheus-stack

2. Health check missing GC — fixed in 9892d40
Added a validate-gc-deployment step asserting node-feature-discovery-gc has availableReplicas > 0, matching the coverage the old gpu-operator assertions had.

3. PR description install order — fixed
Updated the PR body. The actual resolved deploymentOrder puts nfd after cert-manager; since they're independent (no dependencyRefs between them) that's cosmetic ordering, but the description was misleading and now reflects reality.

The branch also dropped three stale Merge branch 'main' commits during the rebase so the history is now two feature commits on top of the two review-response fixes.

@mchmarny

…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny

values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

Extract Node Feature Discovery from gpu-operator's sub-chart and deploy it as a top-level component. Both gpu-operator and network-operator declare dependencyRefs: [nfd] so install ordering is explicit, and every recipe that includes either operator now consumes a single shared NFD instance instead of operator-embedded sub-charts. Follows the cert-manager shared-component pattern. Registry + values: - New top-level entry in recipes/registry.yaml with nodeScheduling paths for master/gc (system nodes) and worker (all nodes via both system AND accelerated tolerations so it lands on tainted nodes regardless of role). - recipes/components/nfd/values.yaml pins master.enable and worker.enable explicitly so behavior doesn't drift with upstream chart defaults; GC enabled; Topology Updater off by default. - recipes/components/gpu-operator/values.yaml: nfd.enabled: false. - recipes/components/network-operator/values.yaml: nfd.deployNodeFeatureRules: true so its NFD rules ship to the shared instance. Overlays: - recipes/overlays/base.yaml wires nfd into every recipe. - recipes/overlays/rtx-pro-6000-lke-{training,inference}.yaml explicitly add nfd to gpu-operator.dependencyRefs where it was re-declared (overlay arrays replace, not merge) so ordering is preserved. - recipes/overlays/kind.yaml drops the no-longer-needed nfd.enabled: true override on gpu-operator. Health check (recipes/checks/nfd/health-check.yaml): - Master Deployment ≥1 ready replica. - Worker DaemonSet fully rolled out — desiredNumberScheduled > 0 AND numberUnavailable == 0 (replaces the weaker numberReady > 0 which passed on partial failures). - GC Deployment ≥1 ready replica. - Pod phase errors on Pending/Failed/Unknown. - Container waiting-reason errors for CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError — catches pods stuck in Running-phase with unhealthy containers. Tests: - tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling-nfd.yaml: companion CUJ1 fixture asserts master/GC on system nodes, worker with accelerated toleration, and worker.nodeSelector absent as an explicit regression guard. - tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml: new step wires the standalone-NFD scheduling assertion into CUJ1-training. - tests/chainsaw/ai-conformance: NFD assertions moved out of gpu-operator sub-chart scope. Design: docs/plans/2026-04-09-nfd-standalone-component-design.md Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

Adds an nfd componentRefs entry to the 12 production GPU leaf overlays (H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping overrides.topologyUpdater.enable to true. Each cluster running these recipes now publishes per-node NodeResourceTopology (NRT) CRDs describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity — the fact base downstream NUMA-aware schedulers and recipe auto-resolution flows can consume. Overlay change: each leaf gains a fresh componentRefs entry with name=nfd, type=Helm, and overrides.topologyUpdater.enable=true. mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the overrides onto the base nfd entry from recipes/overlays/base.yaml without replacing inherited source/version/valuesFile. Chart configuration (recipes/components/nfd/values.yaml): * topologyUpdater.createCRDs: true ensures the noderesourcetopologies CRD is installed by the chart on overlays that flip enable=true. Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not preinstall it. The chart guards the CRD template on (enable && createCRDs), so this is a no-op when TU is off. * topologyUpdater.kubeletStateDir: "" disables the broad /var/lib/kubelet hostPath mount. TU only needs the pod-resources gRPC socket, which is mounted via a dedicated hostPath; the empty state-dir stops the chart from exposing kubelet state files (TLS material, pod manifests, checkpoint data) into the TU container. Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system, accelerated}.tolerationPaths now include topologyUpdater.tolerations so the bundler injects --accelerated-node-toleration values into TU pods. Without this, on every targeted GPU cluster (which carry nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been unschedulable. nodeSelector deliberately not added — TU runs on all nodes (same rationale as worker), and topology data on system nodes is needed for cross-NUMA scheduling decisions. Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded: KWOK and kind clusters lack a real kubelet pod-resources gRPC socket, so TU would CrashLoopBackOff. The KubeletPodResources feature gate has been Beta-default since Kubernetes 1.15 and reached GA in 1.28 (KEP-606); AICR's affected leaves require K8s >= 1.30, so the prerequisite is satisfied in practice. Coverage: * New chainsaw step assert-bundle-topology-nfd in CUJ1-training asserts the rendered NFD bundle contains topologyUpdater.enable= true, createCRDs=true, the GPU-taint toleration, master.enable= true, gc.enable=true, and enableNodeFeatureApi=true. * New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/ metadata_test.go covers all 12 GPU platform+intent overlays (expected ON) and both kind-chain leaves (expected OFF). Type-assertion failures on ON cases promote to t.Fatal so a malformed override produces a loud regression rather than a silent false-negative. * docs/user/component-catalog.md gains the missing NFD row and a Topology Updater section documenting which recipes run TU and the kubelet pod-resources prerequisite. Closes the local NFD adoption initiative (Tracks A and B previously shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C (recipe auto-resolution from NRT data) and scheduler-integration work remain open for future PRs. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:36

github-actions Bot added area/recipes size/M labels Apr 9, 2026

ArangoGutierrez requested a review from a team as a code owner April 9, 2026 17:48

github-actions Bot added area/tests size/L and removed size/M labels Apr 9, 2026

mchmarny assigned ArangoGutierrez Apr 10, 2026

mchmarny added this to the v0.12 milestone Apr 14, 2026

mchmarny requested changes Apr 14, 2026

View reviewed changes

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 55eb6c4 to 9892d40 Compare April 21, 2026 20:46

This comment was marked as resolved.

Sign in to view

ArangoGutierrez commented Apr 21, 2026

View reviewed changes

github-actions Bot added the area/bundler label Apr 22, 2026

This comment was marked as resolved.

Sign in to view

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 20fc985 to f8c9690 Compare April 22, 2026 06:35

This comment was marked as resolved.

Sign in to view

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from 1c0aef8 to d4892b6 Compare April 22, 2026 14:41

ArangoGutierrez requested a review from mchmarny April 22, 2026 19:28

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from d4892b6 to f3e9058 Compare April 22, 2026 19:41

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from f3e9058 to d10dcca Compare April 23, 2026 10:26

github-actions Bot removed the area/bundler label Apr 23, 2026

mchmarny removed this from the v0.12 milestone Apr 23, 2026

mchmarny approved these changes Apr 23, 2026

View reviewed changes

mchmarny added this to the v0.12 milestone Apr 23, 2026

ArangoGutierrez force-pushed the feat/nfd-standalone-component branch from d10dcca to 1b26cf5 Compare April 23, 2026 12:31

mchmarny enabled auto-merge (squash) April 23, 2026 13:11

Merge branch 'main' into feat/nfd-standalone-component

3cb3760

mchmarny merged commit 7f91140 into NVIDIA:main Apr 23, 2026
77 of 78 checks passed

ArangoGutierrez mentioned this pull request Apr 28, 2026

feat(recipes): enable NFD Topology Updater on production GPU recipes #711

Merged

13 tasks

coderabbitai Bot mentioned this pull request Jun 8, 2026

feat(validator): ship chainsaw binary; activate deployment-phase runner #1235

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipe): add NFD as standalone shared component#518

feat(recipe): add NFD as standalone shared component#518
mchmarny merged 2 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-standalone-component

ArangoGutierrez commented Apr 9, 2026 •

edited

Loading

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

ArangoGutierrez left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ArangoGutierrez commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design

Verification

Related

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArangoGutierrez commented Apr 9, 2026 •

edited

Loading