fix: rename prometheus component to kube-prometheus-stack by yuanchen8911 · Pull Request #3 · NVIDIA/aicr

yuanchen8911 · 2026-01-31T00:05:44Z

Summary

Rename component from prometheus to kube-prometheus-stack to match the Helm chart name
Ensures values are correctly passed to the sub-chart in umbrella chart deployments

Problem

The component was named prometheus but the Helm chart is kube-prometheus-stack. When generating umbrella charts, the Chart.yaml dependency uses the actual chart name (kube-prometheus-stack), but values were keyed under prometheus:. This mismatch caused Helm values (like fullnameOverride) to not be passed to the sub-chart.

Changes

pkg/recipe/data/registry.yaml: Rename component from prometheus to kube-prometheus-stack
pkg/recipe/data/components/prometheus/ → pkg/recipe/data/components/kube-prometheus-stack/
pkg/recipe/data/overlays/base.yaml: Update component name and valuesFile path
pkg/recipe/data/overlays/monitoring-hpa.yaml: Update dependencyRef
Keep prometheus in valueOverrideKeys for backwards compatibility with --set prometheus:key=value

Test plan

Generate a bundle and verify Chart.yaml dependency name matches values.yaml key
Deploy with helm install and verify kube-prometheus-stack values are applied correctly
Verify --set prometheus:key=value still works for backwards compatibility

🤖 Generated with Claude Code

mchmarny · 2026-01-31T00:34:06Z

Looks like there will be more work required on this to make it work in CI. @dims @lalitadithya anything we can replicate here form NVS?

Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. This change uses the workflow_run pattern: 1. Main workflow uploads coverage data as artifact (read-only safe) 2. Separate workflow_run triggered workflow posts comment (write perms) This is the recommended secure pattern per GitHub Security Lab: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ Fixes: NVIDIA/aicr#3 Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Davanum Srinivas <[email protected]>

github-actions · 2026-01-31T21:23:13Z

Coverage Report ✅

Metric	Value
Coverage	73.8%
Threshold	70%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-73.8%25-green)

Coverage unchanged by this PR.

Align component name with Helm chart name to ensure values are correctly passed to the sub-chart in umbrella chart deployments. Changes: - Rename component from 'prometheus' to 'kube-prometheus-stack' in registry - Rename components/prometheus directory to components/kube-prometheus-stack - Update base.yaml overlay to use new component name and values path - Update monitoring-hpa.yaml dependency reference - Keep 'prometheus' in valueOverrideKeys for backwards compatibility Co-Authored-By: Claude Opus 4.5 <[email protected]> fix: rename prometheus component to kube-prometheus-stack Align component name with Helm chart name to ensure values are correctly passed to the sub-chart in umbrella chart deployments. Changes: - Rename component from 'prometheus' to 'kube-prometheus-stack' in registry - Rename components/prometheus directory to components/kube-prometheus-stack - Update base.yaml overlay to use new component name and values path - Update monitoring-hpa.yaml dependency reference - Keep 'prometheus' in valueOverrideKeys for backwards compatibility Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...

github-actions · 2026-05-04T06:59:17Z

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

yuanchen8911 requested a review from mchmarny January 31, 2026 00:11

yuanchen8911 force-pushed the fix/kube-prometheus-stack-naming branch from 56d3f26 to ba6c04c Compare January 31, 2026 00:23

dims mentioned this pull request Jan 31, 2026

fix: use workflow_run for PR coverage comments on fork PRs #5

Merged

25 tasks

dims force-pushed the fix/kube-prometheus-stack-naming branch from 0b5b091 to 43c038e Compare January 31, 2026 20:56

dims mentioned this pull request Jan 31, 2026

fix: add actions:read permission for artifact download #6

Merged

25 tasks

dims force-pushed the fix/kube-prometheus-stack-naming branch from 43c038e to 6c6c199 Compare January 31, 2026 21:18

dims approved these changes Feb 2, 2026

View reviewed changes

yuanchen8911 requested review from a team as code owners February 2, 2026 16:40

yuanchen8911 force-pushed the fix/kube-prometheus-stack-naming branch from a6df5f4 to 27afca5 Compare February 2, 2026 16:48

mchmarny merged commit fc68f03 into NVIDIA:main Feb 2, 2026
3 checks passed

dims mentioned this pull request Feb 20, 2026

chore: improve consistency across GPU CI workflows #160

Merged

6 tasks

This was referenced Feb 20, 2026

feat(ci): add CNCF AI conformance validations to inference workflow #162

Merged

feat(validator): add Go-based CNCF AI conformance checks #180

Merged

This was referenced Mar 9, 2026

feat(validation): container-per-validator execution engine #290

Merged

fix(validator): add retry for ai-service-metrics Prometheus query #393

Merged

yuanchen8911 mentioned this pull request Mar 19, 2026

docs: add ADR-004 for hydrated recipe query command #435

Merged

3 tasks

This was referenced Apr 10, 2026

feat(bundler): add --dynamic flag for install-time values (#515) #527

Merged

GPU CI: optimize build/deploy performance, deduplicate workflows, and simplify behavioral tests #541

Closed

yuanchen8911 mentioned this pull request Apr 20, 2026

enhance(validator): add targeted post-deployment GPU readiness checks #611

Merged

25 tasks

This was referenced Apr 24, 2026

bundle --attest has no headless-local OIDC path (device-code or --identity-token) #682

Closed

bundle --attest: no OIDC token caching across runs (every invocation triggers new Rekor entry) #685

Closed

github-actions Bot locked as resolved and limited conversation to collaborators May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: rename prometheus component to kube-prometheus-stack#3

fix: rename prometheus component to kube-prometheus-stack#3
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kube-prometheus-stack-naming

yuanchen8911 commented Jan 31, 2026

Uh oh!

mchmarny commented Jan 31, 2026

Uh oh!

github-actions Bot commented Jan 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuanchen8911 commented Jan 31, 2026

Summary

Problem

Changes

Test plan

Uh oh!

mchmarny commented Jan 31, 2026

Uh oh!

github-actions Bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Uh oh!

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jan 31, 2026 •

edited

Loading