Skip to content

feat(recipes): enable NFD Topology Updater on production GPU recipes#711

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-topology-updater
Apr 30, 2026
Merged

feat(recipes): enable NFD Topology Updater on production GPU recipes#711
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-topology-updater

Conversation

@ArangoGutierrez

Copy link
Copy Markdown
Contributor

Summary

Enables the NFD Topology Updater on the 12 production GPU leaf overlays so every shipped GPU recipe emits per-node NodeResourceTopology CRDs describing NUMA zones and GPU/NIC affinity. Closes the NFD adoption initiative; Tracks A and B already shipped (#482, #494, #495, #502, #511, #518, #688).

Motivation / Context

After Track A (snapshot enrichment) and Track B (NFD as standalone shared component) merged, NFD is in place but the Topology Updater is off by default — no cluster running an AICR-generated GPU recipe has NodeResourceTopology (NRT) data today. NRT is the fact base future NUMA-aware schedulers (KAI, Volcano, kube-scheduler topology-aware plugin) and recipe auto-resolution work will consume. This PR turns it on for the production GPU leaves while keeping kind-chain overlays unaffected (KWOK clusters lack the kubelet podResources gRPC socket NFD TU requires).

Fixes: N/A
Related: #482, #494, #495, #502, #511, #518, #688

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • Recipe engine / data (pkg/recipe) — table-driven regression test in pkg/recipe/metadata_test.go
  • Docs/examples (docs/, examples/) — docs/user/component-catalog.md adds NFD entry + Topology Updater note
  • Other: recipes/overlays/* (12 overlays) and tests/chainsaw/cli/cuj1-training/* (positive assertion fixture)

Implementation Notes

Override mechanism. Each of the 12 GPU leaves gains a fresh componentRefs entry with only name=nfd, type=Helm, and overrides.topologyUpdater.enable=true. mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the overlay's overrides map onto the base nfd entry from recipes/overlays/base.yaml without replacing inherited source/version/valuesFile — so the chart pin lives in one place.

Kind chain excluded. kind.yaml, h100-kind-*.yaml, and kind-*.yaml are intentionally not edited. KWOK and kind clusters lack a real kubelet, so podResources is unavailable and TU would CrashLoopBackOff. The Topology Updater requires KubeletPodResources, which reached GA in Kubernetes 1.27.

Coverage in two layers. A new chainsaw step assert-bundle-topology-nfd in CUJ1-training asserts the rendered NFD bundle contains topologyUpdater.enable: true end-to-end. A new table-driven Go test TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/metadata_test.go covers all 12 GPU leaves (expected ON) and both kind-chain leaves (expected OFF) deterministically.

Pre-existing test restructure. The Go test commit also collapses two pre-existing table-driven tests (TestRecipeMetadataSpecValidateDependencies, TestRecipeMetadataSpecMerge) from the outer-loop pattern into the inner-run-closure pattern to satisfy the tdd-guard 50-line window. Logic and test cases preserved byte-for-byte; verified via diff.

No Go business-logic changes, no schema changes. Pure declarative YAML override + new test + new doc paragraph.

Testing

make qualify
  • All 18 chainsaw tests PASS, including the new assert-bundle-topology-nfd step.
  • All pkg/recipe/... and pkg/bundler/... Go tests PASS with -race.
  • make lint clean (golangci-lint + yamllint + license headers + AGENTS.md sync + docs sidebar).
  • aicr query confirms hydration: all 12 GPU leaves return topologyUpdater.enable=true; both kind leaves return false.
  • TestNFDTopologyUpdater_OverlayCoverage: 14/14 sub-tests PASS (12 ON, 2 OFF).

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Pure additive YAML on production GPU recipes. Worst case: a managed-K8s cloud provider gates podResources and the TU DaemonSet goes non-Ready — failure is loud and self-observable, no silent degradation. NRT pod cost <50 MiB/node. No migration needed; the override only takes effect on next bundle/deploy cycle.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

This comment was marked as resolved.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closes the NFD adoption initiative cleanly. The 12 overlay edits are uniform (name=nfd, type=Helm, overrides.topologyUpdater.enable=true) and rely on the existing mergeComponentRef deep-merge so the chart pin stays in base.yaml — that's the right shape. Kind chain is intentionally excluded with the correct rationale (no kubelet podResources socket).

CI is green across the full KWOK Tier 2 matrix (all 12 production leaves), the new Go regression test, and the new chainsaw assertion for CUJ1. The pre-existing test restructure (closure pattern for tdd-guard 50-line window) is mechanical and preserves logic.

Two nits inline — both non-blocking. The Go test only covers the 12 directly-edited intent-level overlays, not the deeper specialized leaves users actually resolve to; worth a follow-up to broaden the criteria matrix. PR is in draft so leaving as comment.

Comment thread pkg/recipe/metadata_test.go
Comment thread tests/chainsaw/cli/cuj1-training/chainsaw-test.yaml
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-topology-updater branch from 0f2d778 to 9ff43f8 Compare April 28, 2026 21:51
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review April 28, 2026 21:52
@ArangoGutierrez ArangoGutierrez requested review from a team as code owners April 28, 2026 21:52
coderabbitai[bot]

This comment was marked as resolved.

Address two reviewer nits on PR NVIDIA#711:

* mchmarny: TestNFDTopologyUpdater_OverlayCoverage only covered the
  12 platform+intent overlays directly edited by the feature commit.
  Production users actually resolve to deeper specialized leaves
  (Ubuntu, Kubeflow, Dynamo, NIM, COS variants). A future overlay
  edit that replaced (rather than deep-merged) componentRefs at the
  deeper layer would have slipped through. Expand the table from 14
  to 37 rows: add 21 deeper real-cluster leaves (all expected ON via
  base-chain inheritance) plus 2 deeper kind leaves
  (h100-kind-training-kubeflow, h100-kind-inference-dynamo, both
  expected OFF). Add a 'platform' field to the local criteria struct
  to drive Kubeflow/Dynamo/NIM resolution.

* coderabbitai: the wantTUOn=false branch passed silently when
  topologyUpdater.enable was present with wrong type (e.g. quoted
  string "true"), which Helm can still evaluate truthy. Tighten the
  OFF-path: if topologyUpdater is present, a non-bool enable now
  produces t.Fatal (loud regression) and a bool true produces
  t.Errorf with the criteria details.

No production code changes — test coverage only. Existing 14 rows
still pass; new 23 rows pass; OFF-path now catches the string-true
case the original logic accepted.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
coderabbitai[bot]

This comment was marked as resolved.

ArangoGutierrez and others added 2 commits April 28, 2026 16:00
Per project lint guideline (CLAUDE.md troubleshooting: 'add return
after t.Fatal()'). Surfaced by CodeRabbit on 3a7ad60. Mechanical
change; no behavioral impact.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@mchmarny mchmarny enabled auto-merge (squash) April 30, 2026 13:41
mchmarny
mchmarny previously approved these changes Apr 30, 2026
The assert-bundle-topology-nfd chainsaw step hardcoded
${WORK}/bundle/nfd/values.yaml, which doesn't exist after the
NNN-folder bundle layout introduced by NVIDIA#706. CI fails with
"sed: can't read .../bundle/nfd/values.yaml".

Resolve the path via the same glob the sibling assert-bundle-scheduling-nfd
step uses (`bundle/[0-9][0-9][0-9]-nfd/values.yaml`).

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@mchmarny mchmarny merged commit 639c53e into NVIDIA:main Apr 30, 2026
144 of 146 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants