feat: add HardwareDetector interface and measurement keys for NFD integration#482
Merged
Merged
Conversation
Contributor
|
Welcome to AICR, @ArangoGutierrez! Thanks for your first pull request. Before review, please ensure:
A maintainer will review this soon. |
b420e88 to
3770f88
Compare
4e9a4a7 to
96d81fc
Compare
…Detector interface Add NFD API types dependency (sigs.k8s.io/node-feature-discovery/api/[email protected]) for GPU hardware detection without driver dependency. New measurement keys: KeyGPUPresent, KeyDriverLoaded, KeyDetectionSource in pkg/measurement/types.go for NFD-based hardware detection results. New timeout: NFDDetectionTimeout (5s) in pkg/defaults/timeouts.go for local sysfs/procfs operations (PCI enumeration and kernel module listing). New interface: HardwareDetector with HardwareInfo type in pkg/collector/gpu/hardware.go for abstracting GPU hardware detection. PCI and NFD constants deferred to Task 1 when they are consumed. New tests: TestHardwareDetectorInterface validates interface contract with mock implementation. NFDDetectionTimeout added to TestTimeoutConstants. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
96d81fc to
be04d3e
Compare
The blank import of sigs.k8s.io/node-feature-discovery/api/nfd/v1alpha1 executes K8s scheme registration init() functions in every binary. The NFD dependency will be added in Task 1 when actual code uses it. Addresses reviewer feedback: critical finding. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Aligns the struct with the KeyGPUDetectionSource measurement key so implementations don't need to pass detection source out-of-band. Addresses reviewer feedback: important finding. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Adds 'detection failure' test case exercising wantErr=true with nil info. Adds early return after error check to prevent nil-pointer dereference. Adds DetectionSource field assertions to existing test cases. Addresses reviewer feedback: important finding. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
KeyDriverLoaded -> KeyGPUDriverLoaded KeyDetectionSource -> KeyGPUDetectionSource String values unchanged. Aligns with existing KeyGPUDriver, KeyGPUModel, KeyGPUCount naming convention. Addresses reviewer feedback: minor finding. Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
This was referenced Apr 6, 2026
13 tasks
ArangoGutierrez
added a commit
to ArangoGutierrez/aicr
that referenced
this pull request
Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.
Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.
Chart configuration (recipes/components/nfd/values.yaml):
* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
CRD is installed by the chart on overlays that flip enable=true.
Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
preinstall it. The chart guards the CRD template on
(enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
/var/lib/kubelet hostPath mount. TU only needs the pod-resources
gRPC socket, which is mounted via a dedicated hostPath; the empty
state-dir stops the chart from exposing kubelet state files (TLS
material, pod manifests, checkpoint data) into the TU container.
Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.
Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.
Coverage:
* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
asserts the rendered NFD bundle contains topologyUpdater.enable=
true, createCRDs=true, the GPU-taint toleration, master.enable=
true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
metadata_test.go covers all 12 GPU platform+intent overlays
(expected ON) and both kind-chain leaves (expected OFF).
Type-assertion failures on ON cases promote to t.Fatal so a
malformed override produces a loud regression rather than a
silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
Topology Updater section documenting which recipes run TU and
the kubelet pod-resources prerequisite.
Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Task 0 of NFD Snapshot Enrichment (Track A).
Adds shared infrastructure for day-0 GPU hardware detection:
gpu-present,driver-loaded,detection-source(withKeyGPU*prefix convention)NFDDetectionTimeout(5s)HardwareDetectorinterface andHardwareInfotype (includingDetectionSourcefield)Note
The NFD Go dependency (
sigs.k8s.io/node-feature-discovery) and PCI/NFD constants(vendor ID, device classes, source names) are deferred to Task 1 when code imports
NFD packages.
go mod tidyremoves unused imports, andgolangci-lintflagsunused constants, so both are best added alongside their first consumer.
Testing
go build ./...passes-race