Skip to content

chore: update skyhook to latest version#336

Merged
lockwobr merged 2 commits into
mainfrom
chore/skyhook-update-0.14.0
Mar 11, 2026
Merged

chore: update skyhook to latest version#336
lockwobr merged 2 commits into
mainfrom
chore/skyhook-update-0.14.0

Conversation

@lockwobr

Copy link
Copy Markdown
Contributor

Summary

Update skyhook to latest, should fix the issue seen with webhooks.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@github-actions

github-actions Bot commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 73.3%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.3%25-green)

No Go source files changed in this PR.

@lockwobr lockwobr merged commit 318c075 into main Mar 11, 2026
38 of 60 checks passed
@lockwobr lockwobr deleted the chore/skyhook-update-0.14.0 branch March 11, 2026 00:16
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kubeflow-trainer         2.1.0   -> 2.2.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.

Companion changes:

- validators/performance/trainer_lifecycle.go: bump the hardcoded
  Kubeflow Trainer fallback archive from v2.1.0 to v2.2.0 so the
  no-CRD install path matches the chart pin. Verified v2.2.0
  archive layout (manifests/overlays/manager kustomize) and CRD
  identity (trainjobs.trainer.kubeflow.org / v1alpha1) are
  unchanged from v2.1.0.

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 29, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it. Verified locally.

- vendor/github.com/go-openapi/strfmt/{README,CONTRIBUTORS}.md:
  resync with the v0.26.2 content already declared in
  vendor/modules.txt. The prior dep-update commit on main
  (0c939ce) bumped strfmt to v0.26.2 in modules.txt but didn't
  refresh those two doc files, so `go mod vendor` in CI produces
  a diff against the committed vendor and the `tests/Test` gate
  fails. Running `go mod vendor` here picks up the consistent
  v0.26.2 docs.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

- recipes/registry.yaml: also fix the gke-nccl-tcpxo registry
  entry to use the established manifest-only Helm pattern (empty
  `helm.defaultRepository` plus `defaultNamespace: kube-system`)
  instead of the unparsed `manifest:` block. The `manifest:` field
  is not on the ComponentConfig struct, so its `defaultNamespace`
  was silently ignored. Pre-NVIDIA#706 this was inert (manifest-only
  components were installed via raw `kubectl apply`, which routed
  via inline `metadata.namespace`). After NVIDIA#706 wraps every
  component as a local Helm chart, the generated install.sh emits
  `--namespace  --create-namespace` (empty) and Helm fails. This
  blocks every post-NVIDIA#706 GKE-COS H100 KWOK training run, including
  this PR's CI which auto-promotes the GKE-COS Tier-2 matrix when
  registry.yaml or base.yaml change. Switches to the same pattern
  used by `nodewright-customizations`. Verified bundled install.sh
  now contains `--namespace kube-system`. Supersedes NVIDIA#718.

Refs: NVIDIA#698
Closes: NVIDIA#716, NVIDIA#718
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

Refs: NVIDIA#698
Closes: NVIDIA#716
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Phase 1 of the version refresh tracked in NVIDIA#698: minor and patch
bumps across registry defaults and overlay/mixin pins. No values
schema changes required.

  aws-ebs-csi-driver       2.55.0  -> 2.59.0
  cert-manager             v1.17.2 -> v1.20.2
  kube-prometheus-stack    82.8.0  -> 84.4.0
  kueue                    0.17.0  -> 0.17.1
  nodewright-operator      v0.14.0 -> v0.15.1
  nvsentinel               v1.1.0  -> v1.3.0

Excluded from this PR:
- kgateway / kgateway-crds (v2.0.0 -> v2.2.3) — v2.2.3 silently
  drops the `inferenceExtension.enabled` value (no longer in the
  chart's values.yaml). v2.0.0 renders inf_ext_rbac.yaml
  (ClusterRole granting access to inference.networking.x-k8s.io
  inferencemodels/inferencepools) plus KGW_ENABLE_INFER_EXT env;
  v2.2.3 renders neither. AICR uses kgateway specifically for the
  CNCF AI Conformance "Advanced Ingress for AI/ML Inference"
  requirement, so a silent feature regression here would break
  inference bundles. Migration to v2.2.3 needs a values + RBAC
  rework — deferred.
- aws-efa (v0.5.3 -> v0.5.26) — 23 minors require values cleanup
  including a real security-posture change (chart now defaults to
  privileged: true for EFA hardware access, conflicting with our
  hardened allowPrivilegeEscalation: false override). Deferred to
  a follow-up so the change can get proper EKS/security review.
- kai-scheduler (v0.13.0 -> v0.14.1) — KAI-Scheduler was transferred
  from NVIDIA/ to kai-scheduler/ org and chart publishing moved
  with it. New OCI namespace is `ghcr.io/kai-scheduler/kai-scheduler`
  (the old `ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0).
  This is an OCI-source migration plus a bump — coupled changes
  worth their own follow-up PR rather than mixing into pure pin
  bumps here.
- kubeflow-trainer (2.1.0 -> 2.2.0) — chart bump is coupled to a
  Go change in validators/performance/trainer_lifecycle.go (the
  hardcoded fallback archive URL needs to track the chart pin).
  The validator + chart bumps belong together in a follow-up PR
  to keep this PR pure config / no Go changes.

Companion changes:

- examples/recipes/{kind,eks-training,aks-training,eks-gb200-
  ubuntu-training-with-validation}.yaml: refresh the cert-manager,
  nodewright-operator, kube-prometheus-stack, and nvsentinel pins
  to match the bumped registry defaults. Matches the convention
  from prior bump PRs (NVIDIA#283, NVIDIA#336, NVIDIA#450).

- examples/recipes/aks-training.yaml: also remove an orphaned
  `manifestFiles:` reference to
  components/nvsentinel/manifests/allow-intra-namespace.yaml that
  has been broken since NVIDIA#415 (the workaround source file was
  deleted in NVIDIA#309 when nvsentinel was bumped past v0.7.0, but the
  AKS example was added later by copying from another template
  and kept the now-stale reference). Bundling
  examples/recipes/aks-training.yaml currently fails with
  "file does not exist"; this fix restores it.

Refs: NVIDIA#698
Closes: NVIDIA#716
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 9, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants