Skip to content

feat: add HardwareDetector interface and measurement keys for NFD integration#482

Merged
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-infrastructure
Apr 3, 2026
Merged

feat: add HardwareDetector interface and measurement keys for NFD integration#482
mchmarny merged 6 commits into
NVIDIA:mainfrom
ArangoGutierrez:feat/nfd-infrastructure

Conversation

@ArangoGutierrez

@ArangoGutierrez ArangoGutierrez commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Task 0 of NFD Snapshot Enrichment (Track A).

Adds shared infrastructure for day-0 GPU hardware detection:

  • Measurement keys: gpu-present, driver-loaded, detection-source (with KeyGPU* prefix convention)
  • Timeout constant: NFDDetectionTimeout (5s)
  • HardwareDetector interface and HardwareInfo type (including DetectionSource field)

Note

The NFD Go dependency (sigs.k8s.io/node-feature-discovery) and PCI/NFD constants
(vendor ID, device classes, source names) are deferred to Task 1 when code imports
NFD packages. go mod tidy removes unused imports, and golangci-lint flags
unused constants, so both are best added alongside their first consumer.

Testing

  • go build ./... passes
  • All existing tests pass with -race
  • Error-path test case validates nil-safety of HardwareDetector mock

@github-actions

github-actions Bot commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Welcome to AICR, @ArangoGutierrez! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-infrastructure branch from b420e88 to 3770f88 Compare April 2, 2026 18:05
@github-actions github-actions Bot added size/XL and removed size/M labels Apr 2, 2026
@ArangoGutierrez ArangoGutierrez force-pushed the feat/nfd-infrastructure branch 2 times, most recently from 4e9a4a7 to 96d81fc Compare April 2, 2026 18:30
@ArangoGutierrez ArangoGutierrez marked this pull request as ready for review April 2, 2026 18:39
@ArangoGutierrez ArangoGutierrez requested a review from a team as a code owner April 2, 2026 18:39
…Detector interface

Add NFD API types dependency (sigs.k8s.io/node-feature-discovery/api/[email protected])
for GPU hardware detection without driver dependency.

New measurement keys: KeyGPUPresent, KeyDriverLoaded, KeyDetectionSource
in pkg/measurement/types.go for NFD-based hardware detection results.

New timeout: NFDDetectionTimeout (5s) in pkg/defaults/timeouts.go for
local sysfs/procfs operations (PCI enumeration and kernel module listing).

New interface: HardwareDetector with HardwareInfo type in
pkg/collector/gpu/hardware.go for abstracting GPU hardware detection.
PCI and NFD constants deferred to Task 1 when they are consumed.

New tests: TestHardwareDetectorInterface validates interface contract
with mock implementation. NFDDetectionTimeout added to TestTimeoutConstants.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
mchmarny

This comment was marked as resolved.

mchmarny

This comment was marked as resolved.

The blank import of sigs.k8s.io/node-feature-discovery/api/nfd/v1alpha1
executes K8s scheme registration init() functions in every binary.
The NFD dependency will be added in Task 1 when actual code uses it.

Addresses reviewer feedback: critical finding.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Aligns the struct with the KeyGPUDetectionSource measurement key
so implementations don't need to pass detection source out-of-band.

Addresses reviewer feedback: important finding.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Adds 'detection failure' test case exercising wantErr=true with nil info.
Adds early return after error check to prevent nil-pointer dereference.
Adds DetectionSource field assertions to existing test cases.

Addresses reviewer feedback: important finding.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
KeyDriverLoaded -> KeyGPUDriverLoaded
KeyDetectionSource -> KeyGPUDetectionSource

String values unchanged. Aligns with existing KeyGPUDriver, KeyGPUModel,
KeyGPUCount naming convention.

Addresses reviewer feedback: minor finding.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@github-actions github-actions Bot added size/M and removed size/XL labels Apr 3, 2026
@ArangoGutierrez ArangoGutierrez requested a review from mchmarny April 3, 2026 12:09

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fixes

/lgtm

@mchmarny mchmarny merged commit ba20188 into NVIDIA:main Apr 3, 2026
15 checks passed
@ArangoGutierrez ArangoGutierrez deleted the feat/nfd-infrastructure branch April 8, 2026 20:02
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants