Skip to content

feat: wire NFDHardwareDetector into production snapshot pipeline#502

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-integration-wiring
Apr 8, 2026
Merged

feat: wire NFDHardwareDetector into production snapshot pipeline#502
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-integration-wiring

Conversation

@ArangoGutierrez

Copy link
Copy Markdown
Contributor

Summary

Wires NFDHardwareDetector into the GPU collector factory, enabling two-phase GPU detection in production snapshots. This is Wave 2, Task 3 of the NFD Snapshot Enrichment project — the final step that activates all prior NFD infrastructure (#482, #494, #495) in production.

After this change, aicr snapshot automatically runs:

  • Phase 1 (NFD): PCI enumeration via sysfs — detects NVIDIA GPUs without drivers
  • Phase 2 (nvidia-smi): Full driver/hardware telemetry — existing behavior

Key use case: Day-0 GPU detection

On freshly provisioned nodes where NVIDIA drivers are not yet installed, --require-gpu now succeeds if NFD finds GPU hardware via PCI, even though nvidia-smi reports 0 GPUs.

Changes

Commit Description
5461b858 Test covering day-0 scenario (NFD finds 2 GPUs, smi reports 0)
6de48173 --require-gpu error message updated to mention both detection sources
8f3cd099 One-line factory wiring: gpu.WithHardwareDetector(&gpu.NFDHardwareDetector{})
e145131d doc.go rewritten for two-phase collection model

Test Plan

  • go test ./pkg/snapshotter/... -race — all 4 RequireGPU subtests pass (including new day-0 case)
  • go test ./pkg/collector/... -race — factory type assertion test passes
  • go build ./... — clean build
  • make qualify — 72.1% coverage (threshold: 70%), golangci-lint 0 issues
  • macOS graceful degradation: Phase 1 fails gracefully (no sysfs), Phase 2 only

Dependencies

Requires merged PRs:

@ArangoGutierrez

Copy link
Copy Markdown
Contributor Author

cc @mchmarny

Adds test case verifying that --require-gpu succeeds when NFD hardware
detection finds GPUs via PCI enumeration (gpu-count=2) but nvidia-smi
reports 0 GPUs (drivers not installed). This is the primary day-0
use case for NFD integration.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
The error message previously only mentioned nvidia-smi. Now that the
GPU collector runs two-phase detection (NFD PCI + nvidia-smi), the
error message reflects both detection sources.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
CreateGPUCollector now passes NFDHardwareDetector via WithHardwareDetector
option, enabling two-phase GPU collection in production snapshots:
  Phase 1: NFD PCI enumeration (no driver required)
  Phase 2: nvidia-smi telemetry (existing behavior)

On platforms without sysfs (macOS, containers), Phase 1 fails gracefully
and the collector falls back to nvidia-smi-only behavior.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Replaces the nvidia-smi-only documentation with the two-phase detection
model (NFD hardware + nvidia-smi telemetry). Documents graceful
degradation, measurement structure, and platform support.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@yuanchen8911

Copy link
Copy Markdown
Contributor

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Reviewers: Claude Code (5-agent pipeline), Codex, CodeRabbit — all three independently found no actionable issues.

Minor observations (all below threshold, not flagged):

  • verifyGPUCollected comment could mention two-phase semantics explicitly
  • --require-gpu now accepts NFD PCI detection (intentional per PR description)

Cross-review by Claude Code + Codex + CodeRabbit

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-integration-wiring branch from e145131 to c76aaa0 Compare April 7, 2026 19:32
@mchmarny mchmarny added this to the v0.12 milestone Apr 7, 2026
@mchmarny mchmarny enabled auto-merge (squash) April 8, 2026 12:39
@mchmarny mchmarny merged commit 02002ca into NVIDIA:main Apr 8, 2026
15 of 19 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 8, 2026
PR NVIDIA#502 added a Phase 1 "hardware" subtype before the existing "smi"
subtype in GPU measurements. The snapshot validation action used
subtypes[0] to read gpu.model, which now hits "hardware" (no model
field) instead of "smi", causing GPU model: null on all H100 runners.

Fix: query by subtype name (select(.name == "smi")) instead of index.

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 8, 2026
PR NVIDIA#502 added a Phase 1 "hardware" subtype before the existing "smi"
subtype in GPU measurements. The snapshot validation action used
subtypes[0] to read gpu.model, which now hits "hardware" (no model
field) instead of "smi", causing GPU model: null on all H100 runners.

Fix: query by subtype name (select(.name == "smi")) instead of index.

Signed-off-by: Yuan Chen <[email protected]>
@ArangoGutierrez ArangoGutierrez deleted the feat/nfd-integration-wiring branch April 8, 2026 20:02
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants