fix: rename prometheus component to kube-prometheus-stack#3
Merged
mchmarny merged 1 commit intoFeb 2, 2026
Merged
Conversation
56d3f26 to
ba6c04c
Compare
Member
|
Looks like there will be more work required on this to make it work in CI. @dims @lalitadithya anything we can replicate here form NVS? |
dims
added a commit
to dims/cloud-native-stack
that referenced
this pull request
Jan 31, 2026
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. This change uses the workflow_run pattern: 1. Main workflow uploads coverage data as artifact (read-only safe) 2. Separate workflow_run triggered workflow posts comment (write perms) This is the recommended secure pattern per GitHub Security Lab: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ Fixes: NVIDIA/aicr#3 Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Davanum Srinivas <[email protected]>
25 tasks
dims
added a commit
to dims/aicr
that referenced
this pull request
Jan 31, 2026
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. This change uses the workflow_run pattern: 1. Main workflow uploads coverage data as artifact (read-only safe) 2. Separate workflow_run triggered workflow posts comment (write perms) This is the recommended secure pattern per GitHub Security Lab: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ Fixes: NVIDIA/aicr#3 Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Davanum Srinivas <[email protected]>
0b5b091 to
43c038e
Compare
25 tasks
43c038e to
6c6c199
Compare
Contributor
Coverage Report ✅
Coverage BadgeCoverage unchanged by this PR. |
dims
approved these changes
Feb 2, 2026
Align component name with Helm chart name to ensure values are correctly passed to the sub-chart in umbrella chart deployments. Changes: - Rename component from 'prometheus' to 'kube-prometheus-stack' in registry - Rename components/prometheus directory to components/kube-prometheus-stack - Update base.yaml overlay to use new component name and values path - Update monitoring-hpa.yaml dependency reference - Keep 'prometheus' in valueOverrideKeys for backwards compatibility Co-Authored-By: Claude Opus 4.5 <[email protected]> fix: rename prometheus component to kube-prometheus-stack Align component name with Helm chart name to ensure values are correctly passed to the sub-chart in umbrella chart deployments. Changes: - Rename component from 'prometheus' to 'kube-prometheus-stack' in registry - Rename components/prometheus directory to components/kube-prometheus-stack - Update base.yaml overlay to use new component name and values path - Update monitoring-hpa.yaml dependency reference - Keep 'prometheus' in valueOverrideKeys for backwards compatibility Co-Authored-By: Claude Opus 4.5 <[email protected]>
a6df5f4 to
27afca5
Compare
dims
referenced
this pull request
in dims/aicr
Feb 20, 2026
Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>
6 tasks
dims
referenced
this pull request
in dims/aicr
Feb 20, 2026
Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>
This was referenced Feb 20, 2026
This was referenced Mar 9, 2026
3 tasks
This was referenced Apr 10, 2026
25 tasks
This was referenced Apr 24, 2026
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
Contributor
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
prometheustokube-prometheus-stackto match the Helm chart nameProblem
The component was named
prometheusbut the Helm chart iskube-prometheus-stack. When generating umbrella charts, the Chart.yaml dependency uses the actual chart name (kube-prometheus-stack), but values were keyed underprometheus:. This mismatch caused Helm values (likefullnameOverride) to not be passed to the sub-chart.Changes
pkg/recipe/data/registry.yaml: Rename component fromprometheustokube-prometheus-stackpkg/recipe/data/components/prometheus/→pkg/recipe/data/components/kube-prometheus-stack/pkg/recipe/data/overlays/base.yaml: Update component name and valuesFile pathpkg/recipe/data/overlays/monitoring-hpa.yaml: Update dependencyRefprometheusinvalueOverrideKeysfor backwards compatibility with--set prometheus:key=valueTest plan
helm installand verify kube-prometheus-stack values are applied correctly--set prometheus:key=valuestill works for backwards compatibility🤖 Generated with Claude Code