Skip to content

fix(uat): bind Prometheus PVC to cluster default StorageClass#1455

Merged
njhensley merged 2 commits into
NVIDIA:mainfrom
njhensley:fix/uat-prometheus-storageclass
Jun 24, 2026
Merged

fix(uat): bind Prometheus PVC to cluster default StorageClass#1455
njhensley merged 2 commits into
NVIDIA:mainfrom
njhensley:fix/uat-prometheus-storageclass

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Fix the AWS UAT config so the Prometheus PVC binds to the cluster's default StorageClass instead of a hardcoded gp3 name that doesn't exist on an aicr-bundle-deployed EKS cluster.

Motivation / Context

On the AWS UAT cluster, deployment readiness checks blocked indefinitely: the Prometheus StatefulSet was never created. The prometheus-operator was healthy but refused to reconcile with:

ReconciliationFailed: storage class "gp3" does not exist
→ StatefulSetNotFound (DESIRED 1, READY 0)

spec.bundle.scheduling.storageClass was pinned to a literal gp3. An aicr-bundle-deployed EKS cluster has no StorageClass by that name — the aws-ebs-csi-driver component provisions a gp3-backed default named ebs-csi-default-sc (annotated is-default-class=true). A class literally named gp3 only exists on clusters using the AWS managed EBS CSI addon. The stale comment ("Matches the UAT cluster") was incorrect.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Other: tests/uat (AWS UAT test config)

Implementation Notes

Set storageClass: "". The injection path only writes a value when non-empty (pkg/cli/bundle.gopkg/bundler/bundler.go), so an empty value is a true no-op. The EKS overlay (recipes/overlays/eks.yaml) already defines the Prometheus volumeClaimTemplate (50Gi, RWO, emptyDir: null) without a storageClassName, so the PVC binds to the cluster default SC. Persistent storage is preserved (not dropped to emptyDir) and the config no longer depends on how EBS CSI was installed.

GCP UAT config is unchanged — GKE genuinely ships a class named premium-rwo.

Testing

# Verified on the live AWS UAT cluster: created the missing default-equivalent SC,
# operator reconciled, prometheus-kube-prometheus-prometheus StatefulSet 1/1 Running,
# PVC bound. The config change makes this durable across cluster recreates.

Risk Assessment

  • Low — Isolated change to a single UAT test config value, easy to revert.

Rollout notes: N/A — test harness config only; no production code path.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@njhensley njhensley requested a review from a team as a code owner June 24, 2026 20:11
@njhensley njhensley added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 19c182c7-cf93-4aec-bd7d-3ce9a2086ef4

📥 Commits

Reviewing files that changed from the base of the PR and between db3ad73 and fe7890d.

📒 Files selected for processing (1)
  • tests/uat/aws/tests/h100-training-config.yaml

📝 Walkthrough

Walkthrough

The H100 AWS UAT training config changes storageClass from gp3 to an empty value. Comments were added to note that an empty storageClass uses the cluster’s default StorageClass created by the AWS EBS CSI driver.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: binding the Prometheus PVC to the cluster default StorageClass.
Description check ✅ Passed The description is directly related to the change and explains the AWS UAT StorageClass fix and its motivation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@njhensley njhensley enabled auto-merge (squash) June 24, 2026 20:34
@njhensley njhensley merged commit a0f2d09 into NVIDIA:main Jun 24, 2026
30 checks passed
@njhensley njhensley deleted the fix/uat-prometheus-storageclass branch June 24, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests size/S theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants