Skip to content

fix: rename prometheus component to kube-prometheus-stack#3

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kube-prometheus-stack-naming
Feb 2, 2026
Merged

fix: rename prometheus component to kube-prometheus-stack#3
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kube-prometheus-stack-naming

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

  • Rename component from prometheus to kube-prometheus-stack to match the Helm chart name
  • Ensures values are correctly passed to the sub-chart in umbrella chart deployments

Problem

The component was named prometheus but the Helm chart is kube-prometheus-stack. When generating umbrella charts, the Chart.yaml dependency uses the actual chart name (kube-prometheus-stack), but values were keyed under prometheus:. This mismatch caused Helm values (like fullnameOverride) to not be passed to the sub-chart.

Changes

  • pkg/recipe/data/registry.yaml: Rename component from prometheus to kube-prometheus-stack
  • pkg/recipe/data/components/prometheus/pkg/recipe/data/components/kube-prometheus-stack/
  • pkg/recipe/data/overlays/base.yaml: Update component name and valuesFile path
  • pkg/recipe/data/overlays/monitoring-hpa.yaml: Update dependencyRef
  • Keep prometheus in valueOverrideKeys for backwards compatibility with --set prometheus:key=value

Test plan

  • Generate a bundle and verify Chart.yaml dependency name matches values.yaml key
  • Deploy with helm install and verify kube-prometheus-stack values are applied correctly
  • Verify --set prometheus:key=value still works for backwards compatibility

🤖 Generated with Claude Code

@yuanchen8911 yuanchen8911 requested a review from mchmarny January 31, 2026 00:11
@yuanchen8911 yuanchen8911 force-pushed the fix/kube-prometheus-stack-naming branch from 56d3f26 to ba6c04c Compare January 31, 2026 00:23
@mchmarny

Copy link
Copy Markdown
Member

Looks like there will be more work required on this to make it work in CI. @dims @lalitadithya anything we can replicate here form NVS?

dims added a commit to dims/cloud-native-stack that referenced this pull request Jan 31, 2026
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting
comments directly. This change uses the workflow_run pattern:

1. Main workflow uploads coverage data as artifact (read-only safe)
2. Separate workflow_run triggered workflow posts comment (write perms)

This is the recommended secure pattern per GitHub Security Lab:
https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

Fixes: NVIDIA/aicr#3

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Davanum Srinivas <[email protected]>
dims added a commit to dims/aicr that referenced this pull request Jan 31, 2026
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting
comments directly. This change uses the workflow_run pattern:

1. Main workflow uploads coverage data as artifact (read-only safe)
2. Separate workflow_run triggered workflow posts comment (write perms)

This is the recommended secure pattern per GitHub Security Lab:
https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

Fixes: NVIDIA/aicr#3

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Davanum Srinivas <[email protected]>
@dims dims force-pushed the fix/kube-prometheus-stack-naming branch from 0b5b091 to 43c038e Compare January 31, 2026 20:56
@dims dims force-pushed the fix/kube-prometheus-stack-naming branch from 43c038e to 6c6c199 Compare January 31, 2026 21:18
@github-actions

github-actions Bot commented Jan 31, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 73.8%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.8%25-green)

Coverage unchanged by this PR.

@yuanchen8911 yuanchen8911 requested review from a team as code owners February 2, 2026 16:40
Align component name with Helm chart name to ensure values are correctly
passed to the sub-chart in umbrella chart deployments.

Changes:
- Rename component from 'prometheus' to 'kube-prometheus-stack' in registry
- Rename components/prometheus directory to components/kube-prometheus-stack
- Update base.yaml overlay to use new component name and values path
- Update monitoring-hpa.yaml dependency reference
- Keep 'prometheus' in valueOverrideKeys for backwards compatibility

Co-Authored-By: Claude Opus 4.5 <[email protected]>

fix: rename prometheus component to kube-prometheus-stack

Align component name with Helm chart name to ensure values are correctly
passed to the sub-chart in umbrella chart deployments.

Changes:
- Rename component from 'prometheus' to 'kube-prometheus-stack' in registry
- Rename components/prometheus directory to components/kube-prometheus-stack
- Update base.yaml overlay to use new component name and values path
- Update monitoring-hpa.yaml dependency reference
- Keep 'prometheus' in valueOverrideKeys for backwards compatibility

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/kube-prometheus-stack-naming branch from a6df5f4 to 27afca5 Compare February 2, 2026 16:48
@mchmarny mchmarny merged commit fc68f03 into NVIDIA:main Feb 2, 2026
3 checks passed
dims referenced this pull request in dims/aicr Feb 20, 2026
Add three new validation steps to the H100 inference test:

- Inference Gateway (#6): verify GatewayClass accepted and Gateway
  programmed with inference extension CRDs present
- Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter
  metrics, Prometheus scraping, and custom metrics API availability
- Secure Accelerator Access (#3): verify GPU access is DRA-mediated
  (no hostPath, no device plugin), with proper container security

Also adds diagnostics for gateway, metrics, and DRA state on failure.

Signed-off-by: Davanum Srinivas <[email protected]>
dims referenced this pull request in dims/aicr Feb 20, 2026
Add three new validation steps to the H100 inference test:

- Inference Gateway (#6): verify GatewayClass accepted and Gateway
  programmed with inference extension CRDs present
- Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter
  metrics, Prometheus scraping, and custom metrics API availability
- Secure Accelerator Access (#3): verify GPU access is DRA-mediated
  (no hostPath, no device plugin), with proper container security

Also adds diagnostics for gateway, metrics, and DRA state on failure.

Signed-off-by: Davanum Srinivas <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, and
demo migration to the new RuntimePatches API
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides`
is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still
admits the old field name for compat, but the controller no longer
applies it — pods come out with no override fields, and on AICR's
tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand
the demo previously used silently no-ops, leaving pods Pending.

The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and
`demos/cuj1-gke.md` is migrated to the new shape:

  spec:
    runtimePatches:
      - manager: aicr.nvidia.com/demo
        trainingRuntimeSpec:
          template:
            spec:
              replicatedJobs:
                - name: node
                  template:
                    spec:
                      template:
                        spec:
                          nodeSelector: {nodeGroup: gpu-worker}
                          tolerations:
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoSchedule}
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoExecute}

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
TrainJob admitted, pod scheduled to the GPU node with the expected
tolerations + nodeSelector, training completed in 2m39s with
accuracy=0.7413 (matches pre-upgrade baseline).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, and
demo migration to the new RuntimePatches API
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides`
is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still
admits the old field name for compat, but the controller no longer
applies it — pods come out with no override fields, and on AICR's
tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand
the demo previously used silently no-ops, leaving pods Pending.

The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and
`demos/cuj1-gke.md` is migrated to the new shape:

  spec:
    runtimePatches:
      - manager: aicr.nvidia.com/demo
        trainingRuntimeSpec:
          template:
            spec:
              replicatedJobs:
                - name: node
                  template:
                    spec:
                      template:
                        spec:
                          nodeSelector: {nodeGroup: gpu-worker}
                          tolerations:
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoSchedule}
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoExecute}

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
TrainJob admitted, pod scheduled to the GPU node with the expected
tolerations + nodeSelector, training completed in 2m39s with
accuracy=0.7413 (matches pre-upgrade baseline).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, demo
migration to RuntimePatches, and ClusterTrainingRuntime alignment
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships two breaking API changes that touch AICR:

  1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309).
     The CRD still admits the old field for compat but the v2.2
     controller no longer applies it. The pytorch-mnist demo TrainJob
     in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to
     the `runtimePatches` shape with `manager: aicr.nvidia.com/demo`
     and explicit per-cluster scheduling (the EKS demo carries the
     AICR-standard `dedicated=worker-workload` tolerations + NoExecute
     effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule`
     and `nvidia.com/gpu=present:NoSchedule` to match the rest of the
     GKE flow).

  2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239).
     Upstream removed the field from the Torch policy because it now
     infers parallelism from the container's `nvidia.com/gpu` limit.
     `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI
     test fixtures stay as-is. AICR's `torch-distributed`
     ClusterTrainingRuntime is updated from
     `mlPolicy.torch: { numProcPerNode: auto }` to
     `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime.

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
demo TrainJob admitted, pod scheduled with the migrated runtimePatches,
training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade
baseline). 2-replica Deployment with `schedulerName: kai-scheduler` +
DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also
scheduled cleanly with `priorityClassName: train` (each replica got
its own H100 via DRA).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants