Skip to content

fix(recipes): use NFD chart version 0.18.3 without v prefix#688

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/nfd-chart-version
Apr 27, 2026
Merged

fix(recipes): use NFD chart version 0.18.3 without v prefix#688
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/nfd-chart-version

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Drop the v prefix from the nfd componentRef version in recipes/overlays/base.yaml so it matches the chart version published in the upstream Helm index (0.18.3, with appVersion: v0.18.3).

Motivation/Context

CI was logging:

Installing nfd (node-feature-discovery)...
Release "nfd" does not exist. Installing it now.
level=WARN msg="unable to find exact version requested; falling back to closest available version" chart=node-feature-discovery requested=v0.18.3 selected=0.18.3

Verified against the published chart index (https://kubernetes-sigs.github.io/node-feature-discovery/charts/index.yaml):

appVersion: v0.18.3
version: 0.18.3

--version to Helm should be the chart version (0.18.3), not the appVersion (v0.18.3). Other charts in the same recipe that publish without a v prefix (kube-prometheus-stack, k8s-ephemeral-storage-metrics, nvidia-dra-driver-gpu) already follow this convention.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

Components Affected

  • Recipes / overlays

Implementation Notes

One-line change: `version: v0.18.3` -> `version: 0.18.3`.

Testing

This is a recipe-config change, not a Go-code change.

  • `yamllint recipes/overlays/base.yaml` -> clean
  • `go test ./pkg/recipe/...` -> ok
  • `go test ./pkg/bundler/deployer/helm/...` -> ok

Skipping full `make qualify` per CLAUDE.local.md infra-only rule: tests, e2e, and Go lint cannot regress from a single value change in a recipe overlay; the recipe loader and the helm bundler tests both exercise this file and pass.

Risk Assessment

Low. Helm was already installing the same chart via fuzzy fallback. This change only removes the warning by asking for the exact published version.

Checklist

  • Commit cryptographically signed
  • Rebased onto `origin/main`
  • No new lint or test failures in affected packages

@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 140627eb-cc9f-41f6-b314-c42e12092817

📥 Commits

Reviewing files that changed from the base of the PR and between 1e559aa and ce9fdcc.

📒 Files selected for processing (1)
  • recipes/overlays/base.yaml

📝 Walkthrough

Walkthrough

A version string was changed in recipes/overlays/base.yaml for the nfd Helm chart: the version value was updated from v0.18.3 to 0.18.3, removing the v prefix. No other fields, component definitions, or configurations were modified.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: removing the 'v' prefix from the NFD chart version in the recipes overlay, which matches the actual single-line modification.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation (Helm warning), the upstream chart version mismatch, testing performed, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

The node-feature-discovery Helm chart is published as version 0.18.3
(no v prefix); v0.18.3 is the appVersion. With version: v0.18.3 the
recipe asked for a tag that does not exist in the chart index, and Helm
fell back to the closest match:

  level=WARN msg="unable to find exact version requested; falling back
  to closest available version" chart=node-feature-discovery
  requested=v0.18.3 selected=0.18.3

Use 0.18.3 to match the published chart version exactly and silence
the warning.
@yuanchen8911 yuanchen8911 force-pushed the fix/nfd-chart-version branch from 6b830db to ce9fdcc Compare April 27, 2026 15:18
@mchmarny mchmarny enabled auto-merge (squash) April 27, 2026 15:36
@mchmarny mchmarny merged commit 8d0168e into NVIDIA:main Apr 27, 2026
85 checks passed
lockwobr pushed a commit that referenced this pull request Apr 28, 2026
ArangoGutierrez added a commit to ArangoGutierrez/aicr that referenced this pull request Apr 28, 2026
Adds an nfd componentRefs entry to the 12 production GPU leaf overlays
(H100, GB200, RTX Pro 6000 on EKS / AKS / GKE / OKE / LKE) flipping
overrides.topologyUpdater.enable to true. Each cluster running these
recipes now publishes per-node NodeResourceTopology (NRT) CRDs
describing NUMA zones, GPU-to-NUMA affinity, and NIC-to-NUMA affinity
— the fact base downstream NUMA-aware schedulers and recipe
auto-resolution flows can consume.

Overlay change: each leaf gains a fresh componentRefs entry with
name=nfd, type=Helm, and overrides.topologyUpdater.enable=true.
mergeComponentRef (pkg/recipe/metadata.go:532) deep-merges the
overrides onto the base nfd entry from recipes/overlays/base.yaml
without replacing inherited source/version/valuesFile.

Chart configuration (recipes/components/nfd/values.yaml):

* topologyUpdater.createCRDs: true ensures the noderesourcetopologies
  CRD is installed by the chart on overlays that flip enable=true.
  Managed K8s control planes (EKS / AKS / GKE / OKE / LKE) do not
  preinstall it. The chart guards the CRD template on
  (enable && createCRDs), so this is a no-op when TU is off.
* topologyUpdater.kubeletStateDir: "" disables the broad
  /var/lib/kubelet hostPath mount. TU only needs the pod-resources
  gRPC socket, which is mounted via a dedicated hostPath; the empty
  state-dir stops the chart from exposing kubelet state files (TLS
  material, pod manifests, checkpoint data) into the TU container.

Scheduling (recipes/registry.yaml): nfd.nodeScheduling.{system,
accelerated}.tolerationPaths now include topologyUpdater.tolerations
so the bundler injects --accelerated-node-toleration values into TU
pods. Without this, on every targeted GPU cluster (which carry
nvidia.com/gpu=present:NoSchedule), the TU DaemonSet would have been
unschedulable. nodeSelector deliberately not added — TU runs on all
nodes (same rationale as worker), and topology data on system nodes
is needed for cross-NUMA scheduling decisions.

Kind-chain overlays (h100-kind-*, kind-*) are intentionally excluded:
KWOK and kind clusters lack a real kubelet pod-resources gRPC socket,
so TU would CrashLoopBackOff. The KubeletPodResources feature gate
has been Beta-default since Kubernetes 1.15 and reached GA in 1.28
(KEP-606); AICR's affected leaves require K8s >= 1.30, so the
prerequisite is satisfied in practice.

Coverage:

* New chainsaw step assert-bundle-topology-nfd in CUJ1-training
  asserts the rendered NFD bundle contains topologyUpdater.enable=
  true, createCRDs=true, the GPU-taint toleration, master.enable=
  true, gc.enable=true, and enableNodeFeatureApi=true.
* New TestNFDTopologyUpdater_OverlayCoverage in pkg/recipe/
  metadata_test.go covers all 12 GPU platform+intent overlays
  (expected ON) and both kind-chain leaves (expected OFF).
  Type-assertion failures on ON cases promote to t.Fatal so a
  malformed override produces a loud regression rather than a
  silent false-negative.
* docs/user/component-catalog.md gains the missing NFD row and a
  Topology Updater section documenting which recipes run TU and
  the kubelet pod-resources prerequisite.

Closes the local NFD adoption initiative (Tracks A and B previously
shipped via PRs NVIDIA#482, NVIDIA#494, NVIDIA#495, NVIDIA#502, NVIDIA#511, NVIDIA#518, NVIDIA#688). Track C
(recipe auto-resolution from NRT data) and scheduler-integration
work remain open for future PRs.

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants