Skip to content

feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays#415

Merged
mchmarny merged 10 commits into
NVIDIA:mainfrom
Jont828:add-aks-support
Mar 17, 2026
Merged

feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays#415
mchmarny merged 10 commits into
NVIDIA:mainfrom
Jont828:add-aks-support

Conversation

@Jont828

@Jont828 Jont828 commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Mirrors the existing EKS overlay structure with AKS-specific changes:

  • Storage: gp2 → managed-csi (Azure Disk CSI, built-in AKS addon)
  • Networking: No aws-efa equivalent needed (InfiniBand native on ND-series VMs)
  • GPU drivers: Disabled in GPU Operator (AKS pre-installs NVIDIA drivers/toolkit)
  • Skyhook: Customizations omitted (packages don't support aks yet; follows Kind pattern)
  • H100 only (GB200 not available on Azure)

Summary

Motivation / Context

Fixes:
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Mirrors the existing EKS overlay structure with AKS-specific changes:
- Storage: gp2 → managed-csi (Azure Disk CSI, built-in AKS addon)
- Networking: No aws-efa equivalent needed (InfiniBand native on ND-series VMs)
- GPU drivers: Disabled in GPU Operator (AKS pre-installs NVIDIA drivers/toolkit)
- Skyhook: Customizations omitted (packages don't support aks yet; follows Kind pattern)
- H100 only (GB200 not available on Azure)

Signed-off-by: Jont828 <[email protected]>
@Jont828 Jont828 requested review from a team as code owners March 16, 2026 20:05
@copy-pr-bot

copy-pr-bot Bot commented Mar 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @Jont828! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@mchmarny mchmarny added this to the M2 - KubeCon EU milestone Mar 16, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start on AKS support! The overlay hierarchy mirrors other services and the Azure-specific adaptations (managed-csi storage, disabled driver/toolkit, no EFA equivalent) are well-reasoned. Left a few inline comments — nothing major, mostly verification questions and a minor cleanup suggestion. Looking forward to seeing this land!

Comment thread recipes/overlays/aks.yaml
Comment thread recipes/components/gpu-operator/values-aks-training.yaml Outdated
Comment thread recipes/overlays/h100-aks-training.yaml Outdated
Comment thread recipes/overlays/aks-inference.yaml

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Please take a look at the comment: #415 (comment)

@Jont828 Jont828 changed the title [WIP] feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays feat: Add AKS (Azure Kubernetes Service) H100 recipe overlays Mar 16, 2026
@yuanchen8911

Copy link
Copy Markdown
Contributor

Thanks for the PR! cc @mchmarny

# Severity Issue Suggested Fix
1 High helm-values check not in validator catalog — will fail deployment validation at runtime Remove - helm-values from deployment checks in h100-aks-ubuntu-training.yaml:50, h100-aks-ubuntu-inference-dynamo.yaml:75, and examples/recipes/aks-training.yaml:171. EKS equivalents already removed it (PR #388).
2 High AKS inference inherits default GPU Operator values with driver/toolkit enabled — AKS pre-installs both Move valuesFile: components/gpu-operator/values-aks-training.yaml (or a renamed values-aks.yaml) to aks.yaml base componentRefs so both training and inference inherit it. Currently only aks-training.yaml:38 sets it.
3 Low (optional) kubeflow-trainer missing dependencyRefs in h100-aks-ubuntu-training-kubeflow.yaml:45 Add dependencyRefs: [cert-manager, kube-prometheus-stack, gpu-operator] to match EKS/GKE equivalents.
4 Low (optional) No AKS conformance test in conformance_test.go Add h100-aks-ubuntu-inference-dynamo and/or h100-aks-ubuntu-training test cases to TestConformanceRecipeInvariants.
5 Low (optional) examples/recipes/aks-training.yaml drifted from generated output Pins skyhook-operator v0.13.1 (current: v0.14.0); includes helm-values check (line 171). Regenerate with aicr recipe.

Jont828 added 3 commits March 16, 2026 21:01
…erlay

AKS inference recipes silently inherited the base values.yaml (with
toolkit.enabled: true) because neither aks.yaml nor aks-inference.yaml
overrode the gpu-operator valuesFile. Since AKS pre-installs the NVIDIA
container toolkit, this caused conflicts on inference deployments.

Create values-aks.yaml with the shared toolkit disable and wire it into
the aks.yaml base overlay so all AKS intents inherit it. Slim down
values-aks-training.yaml to only training-specific settings.

Add docs/integrator/aks-gpu-setup.md documenting the --gpu-driver none
nodepool prerequisite to avoid driver conflicts with GPU Operator.

Signed-off-by: Jont828 <[email protected]>
Remove non-existent network-operator-health check from aks.yaml
conformance validation, remove stale helm-values check references,
fix YAML comment indentation for yamllint compliance, add missing
AKS GPU Setup sidebar entry, and add kubeflow-trainer dependency
refs.

Signed-off-by: Jont828 <[email protected]>
Jont828 and others added 3 commits March 16, 2026 22:07
DRA (Dynamic Resource Allocation) graduated to GA in Kubernetes 1.34
with stable resource.k8s.io/v1 APIs. Bump the AKS overlay K8s version
constraint from >= 1.28 to >= 1.34, update integrator and user docs
with version requirements, feature gate timeline, CLI overrides, and
device-plugin vs DRA guidance. Add AKS to supported platforms in README.

Signed-off-by: Jont828 <[email protected]>
…overlay

Add K8s >= 1.34 constraint, nvidia-dra-driver-gpu component ref with
gpuResourcesEnabledOverride, and dra-support conformance check to the
h100-aks-ubuntu-inference-dynamo overlay.

Signed-off-by: Jont828 <[email protected]>
@yuanchen8911 yuanchen8911 self-requested a review March 17, 2026 15:54

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yuanchen8911

Copy link
Copy Markdown
Contributor

Need to rebase.

@mchmarny mchmarny enabled auto-merge (squash) March 17, 2026 15:56
@mchmarny mchmarny merged commit 87fd28f into NVIDIA:main Mar 17, 2026
60 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

- recipes/registry.yaml: also fix the gke-nccl-tcpxo registry
  entry to use the established manifest-only Helm pattern (empty
  `helm.defaultRepository` plus `defaultNamespace: kube-system`)
  instead of the unparsed `manifest:` block. The `manifest:` field
  is not on the ComponentConfig struct, so its `defaultNamespace`
  was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only
  components were installed via raw `kubectl apply`, which routed
  via inline `metadata.namespace`). After NVIDIA#706 wraps every
  component as a local Helm chart, the generated install.sh emits
  `--namespace  --create-namespace` (empty) and Helm fails. This
  blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including
  this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when
  registry.yaml or base.yaml change. Switches to the same pattern
  used by `nodewright-customizations`. Verified bundled install.sh
  now contains `--namespace kube-system`. Supersedes NVIDIA#718.

Refs: NVIDIA#698
Closes: NVIDIA#716, NVIDIA#718
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

Refs: NVIDIA#698
Closes: NVIDIA#716
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants