feat(recipe): add NFD as standalone shared component#518
Conversation
mchmarny
left a comment
There was a problem hiding this comment.
Clean, declarative PR that follows the cert-manager shared-component pattern. No Go code changes needed — the registry + overlay system handles everything. Good separation of concerns.
Issues to address:
-
Missing overlays —
rtx-pro-6000-lke-inference.yamlandrtx-pro-6000-lke-training.yamlre-declare gpu-operator'sdependencyRefswithoutnfd. If overlays replace rather than merge, gpu-operator has no ordering dependency on nfd in those recipes. -
Health check missing GC — The values enable
gc.enable: trueand the registry configures GC scheduling paths, but the health check only validates Master and Worker. The old gpu-operator assertions checked GC — this is a coverage regression.
Minor:
- PR description says "nfd → cert-manager" install order, but the actual computed deployment order (visible in test assertions) places nfd after cert-manager. Not a bug (they're independent), but the description is misleading.
Looks good:
- Worker excluded from accelerated
nodeSelectorPathsso it runs everywhere — correct and well-commented nfd.enabled: falsein gpu-operator +deployNodeFeatureRules: truein network-operator is the right combination- kind overlay cleanup (removing
nfd.enabled: trueoverride) is correct - Old NFD assertions removed from
assert-gpu-operator.yamlsince they'd check the wrong namespace now - Registry entry is well-structured with proper nodeScheduling paths
55eb6c4 to
9892d40
Compare
…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
This comment was marked as resolved.
This comment was marked as resolved.
ArangoGutierrez
left a comment
There was a problem hiding this comment.
@mchmarny thanks for the thorough review. Rebased on latest main and addressed all three items:
1. Missing nfd in dependencyRefs — fixed in 4baeb50
Added nfd to gpu-operator.dependencyRefs in both rtx-pro-6000-lke-inference.yaml and rtx-pro-6000-lke-training.yaml. I also audited every overlay that declares dependencyRefs: on gpu-operator or network-operator and confirmed those two were the only regressions — the rest already include nfd. Verified end-to-end:
$ aicr query --service lke --accelerator rtx-pro-6000 --intent training \
--selector 'components.gpu-operator.dependencyRefs'
- nfd
- cert-manager
- kube-prometheus-stack2. Health check missing GC — fixed in 9892d40
Added a validate-gc-deployment step asserting node-feature-discovery-gc has availableReplicas > 0, matching the coverage the old gpu-operator assertions had.
3. PR description install order — fixed
Updated the PR body. The actual resolved deploymentOrder puts nfd after cert-manager; since they're independent (no dependencyRefs between them) that's cosmetic ordering, but the description was misleading and now reflects reality.
The branch also dropped three stale Merge branch 'main' commits during the rebase so the history is now two feature commits on top of the two review-response fixes.
…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
20fc985 to
f8c9690
Compare
…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1c0aef8 to
d4892b6
Compare
…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
d4892b6 to
f3e9058
Compare
…ke overlays Overlay arrays replace rather than merge, so these two overlays that re-declare gpu-operator.dependencyRefs were losing the nfd ordering dependency introduced by the standalone NFD component. Addresses @mchmarny review comment NVIDIA#1 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
values.yaml sets gc.enable: true and registry.yaml configures GC nodeScheduling paths, but the health check only validated Master and Worker. Adds an assertion on the node-feature-discovery-gc Deployment so a broken GC is caught by the conformance run. Addresses @mchmarny review comment NVIDIA#2 on NVIDIA#518. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
f3e9058 to
d10dcca
Compare
Extract Node Feature Discovery from gpu-operator's sub-chart and deploy
it as a top-level component. Both gpu-operator and network-operator
declare dependencyRefs: [nfd] so install ordering is explicit, and every
recipe that includes either operator now consumes a single shared NFD
instance instead of operator-embedded sub-charts. Follows the
cert-manager shared-component pattern.
Registry + values:
- New top-level entry in recipes/registry.yaml with nodeScheduling
paths for master/gc (system nodes) and worker (all nodes via both
system AND accelerated tolerations so it lands on tainted nodes
regardless of role).
- recipes/components/nfd/values.yaml pins master.enable and
worker.enable explicitly so behavior doesn't drift with upstream
chart defaults; GC enabled; Topology Updater off by default.
- recipes/components/gpu-operator/values.yaml: nfd.enabled: false.
- recipes/components/network-operator/values.yaml:
nfd.deployNodeFeatureRules: true so its NFD rules ship to the
shared instance.
Overlays:
- recipes/overlays/base.yaml wires nfd into every recipe.
- recipes/overlays/rtx-pro-6000-lke-{training,inference}.yaml
explicitly add nfd to gpu-operator.dependencyRefs where it was
re-declared (overlay arrays replace, not merge) so ordering is
preserved.
- recipes/overlays/kind.yaml drops the no-longer-needed
nfd.enabled: true override on gpu-operator.
Health check (recipes/checks/nfd/health-check.yaml):
- Master Deployment ≥1 ready replica.
- Worker DaemonSet fully rolled out — desiredNumberScheduled > 0 AND
numberUnavailable == 0 (replaces the weaker numberReady > 0 which
passed on partial failures).
- GC Deployment ≥1 ready replica.
- Pod phase errors on Pending/Failed/Unknown.
- Container waiting-reason errors for CrashLoopBackOff,
ImagePullBackOff, ErrImagePull, CreateContainerConfigError — catches
pods stuck in Running-phase with unhealthy containers.
Tests:
- tests/chainsaw/cli/cuj1-training/assert-bundle-scheduling-nfd.yaml:
companion CUJ1 fixture asserts master/GC on system nodes, worker
with accelerated toleration, and worker.nodeSelector absent as an
explicit regression guard.
- tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml: new step wires
the standalone-NFD scheduling assertion into CUJ1-training.
- tests/chainsaw/ai-conformance: NFD assertions moved out of
gpu-operator sub-chart scope.
Design: docs/plans/2026-04-09-nfd-standalone-component-design.md
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
d10dcca to
1b26cf5
Compare
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.
Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.
Chart configuration (recipes/components/nfd/values.yaml):
* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
CRD is installed by the chart on overlays that flip enable=true.
Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
preinstall it. The chart guards the CRD template on
(enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
/var/lib/kubelet hostPath mount. TU only needs the pod-resources
gRPC socket, which is mounted via a dedicated hostPath; the empty
state-dir stops the chart from exposing kubelet state files (TLS
material, pod manifests, checkpoint data) into the TU container.
Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.
Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.
Coverage:
* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
asserts the rendered NFD bundle contains topologyUpdater.enable=
true, createCRDs=true, the GPU-taint toleration, master.enable=
true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
metadata_test.go covers all 12 GPU platform+intent overlays
(expected ON) and both kind-chain leaves (expected OFF).
Type-assertion failures on ON cases promote to t.Fatal so a
malformed override produces a loud regression rather than a
silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
Topology Updater section documenting which recipes run TU and
the kubelet pod-resources prerequisite.
Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Summary
dependencyRefsNodeFeatureRuleCR (deployNodeFeatureRules: true)Changes
recipes/registry.yamlnfdcomponent with helm config, healthCheck, nodeScheduling; remove NFD paths from gpu-operatorrecipes/components/nfd/values.yamlrecipes/overlays/base.yamlnfdcomponentRef before cert-managerrecipes/overlays/*.yamlnfdto gpu-operator/network-operatordependencyRefson every overlay that re-declares them (includingrtx-pro-6000-lke-{inference,training})recipes/components/gpu-operator/values.yamlnfd.enabled: false(disable sub-chart)recipes/components/network-operator/values.yamlnfd.deployNodeFeatureRules: truerecipes/overlays/kind.yamlnfd.enabled: trueoverride (no longer needed)recipes/checks/nfd/health-check.yamlDesign
Follows the existing cert-manager pattern — purely declarative, no Go code changes. NFD deploys to
node-feature-discoverynamespace. Worker DaemonSet runs on all nodes (excluded from accelerated nodeSelector) so it discovers hardware features everywhere.Resolved deployment order (from
aicr query --selector deploymentOrder): cert-manager → kube-prometheus-stack → k8s-ephemeral-storage-metrics → nfd → gpu-operator → kai-scheduler → … nfd and cert-manager are independent components; thedependencyRefson gpu-operator/network-operator only require nfd to be up before those consumers start.Verification
make test— all packages pass with race detector (72.5% coverage)make lint(golangci-lint v2.10.1) — 0 Go issuesmake build— builds successfullyaicr queryconfirmsgpu-operator.values.nfd.enabled: falseaicr queryconfirmsnetwork-operator.values.nfd.deployNodeFeatureRules: trueaicr queryconfirmsnfd ∈ gpu-operator.dependencyRefsforrtx-pro-6000-lke-{inference,training}nfdin dependencyRefsRelated
Track B of NFD Adoption (Track A — NFD snapshot enrichment — completed 2026-04-08).