chore (validator): pin Trainer self-install JobSet image to promoted registry#1440
Conversation
📝 WalkthroughWalkthroughA new helper Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/performance/trainer_lifecycle_test.go`:
- Around line 70-78: The test suite for the rewriteJobSetStagingImage function
currently only covers tag preservation with
TestRewriteJobSetStagingImage_PreservesTag, but the implementation contract
indicates that both tags and digests should be preserved. Add a new test
function (e.g., TestRewriteJobSetStagingImage_PreservesDigest) that verifies the
rewriteJobSetStagingImage function correctly preserves image digests in the
`@sha256`:... format when rewriting the staging image repository prefix, following
the same pattern as the existing tag preservation test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 28a169ec-7a2c-4e43-ad52-3b1008ba16f2
📒 Files selected for processing (2)
validators/performance/trainer_lifecycle.govalidators/performance/trainer_lifecycle_test.go
| // TestRewriteJobSetStagingImage_PreservesTag verifies the rewrite is a repo-prefix swap | ||
| // that preserves the original tag. | ||
| func TestRewriteJobSetStagingImage_PreservesTag(t *testing.T) { | ||
| in := "image: " + jobSetStagingImageRepo + ":v0.11.0" | ||
| want := "image: " + jobSetPromotedImageRepo + ":v0.11.0" | ||
| if got := string(rewriteJobSetStagingImage([]byte(in))); got != want { | ||
| t.Errorf("got %q, want %q", got, want) | ||
| } | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Add a digest-preservation case to match the helper contract.
The implementation comment says tag/digest are preserved, but tests only cover tags. Add one @sha256:... input case to lock this behavior.
Proposed test addition
func TestRewriteJobSetStagingImage(t *testing.T) {
tests := []struct {
name string
input string
wantChanged bool
}{
@@
{
+ name: "staging image with digest is repointed and digest preserved",
+ input: "image: us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset@sha256:0123456789abcdef",
+ wantChanged: true,
+ },
+ {
name: "already-promoted image is left untouched",
input: "image: registry.k8s.io/jobset/jobset:v0.11.0",
wantChanged: false,
},🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@validators/performance/trainer_lifecycle_test.go` around lines 70 - 78, The
test suite for the rewriteJobSetStagingImage function currently only covers tag
preservation with TestRewriteJobSetStagingImage_PreservesTag, but the
implementation contract indicates that both tags and digests should be
preserved. Add a new test function (e.g.,
TestRewriteJobSetStagingImage_PreservesDigest) that verifies the
rewriteJobSetStagingImage function correctly preserves image digests in the
`@sha256`:... format when rewriting the staging image repository prefix, following
the same pattern as the existing tag preservation test.
…gistry The NCCL performance validator can install Kubeflow Trainer itself when the cluster lacks it — an undocumented, ephemeral test fixture in validators/performance/trainer_lifecycle.go (install-if-absent, run, tear down). That self-install builds the upstream Trainer v2.2.0 GitHub kustomize manifests, which pin the JobSet controller image to the Kubernetes staging registry (us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset:v0.11.0). Those staging tags are garbage-collected, so jobset-controller-manager enters ImagePullBackOff, its admission webhook has no endpoints, and the NCCL TrainJob cannot start. Rewrite the JobSet image repository onto the promoted production registry (registry.k8s.io/jobset/jobset, same tag) before applying the manifests. This aligns the self-install fixture with AICR's kubeflow-trainer Helm chart, which already deploys the promoted image. The swap is a repo-prefix replacement, so it is tag-agnostic and a no-op when the staging repo is absent. Fixes NVIDIA#1430
9f3bc31 to
2cf2998
Compare
Summary
Pin the JobSet controller image used by the NCCL performance validator's Kubeflow Trainer self-install to the promoted production registry (
registry.k8s.io/jobset/jobset), aligning that path with AICR'skubeflow-trainerHelm chart and making the performance phase resilient on clusters that lack a pre-installed Trainer.Motivation / Context
The performance validator can install Kubeflow Trainer itself when the cluster lacks it — an undocumented, ephemeral test fixture in
validators/performance/trainer_lifecycle.go(install-if-absent → run NCCL test → tear down; gated only by a CRD existence check, no flag). That self-install builds the upstream Trainer v2.2.0 GitHub kustomize manifests, which pin the JobSet controller image to the Kubernetes staging registryus-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset:v0.11.0. Staging tags are garbage-collected, sojobset-controller-managerentersImagePullBackOff, its admission webhook (mjobset.kb.io) has no endpoints, and the NCCL TrainJob cannot start — the performance phase fails after a long timeout.This is not a regression and is confined to one path. The supported, documented path — installing Trainer via AICR's
kubeflow-trainerHelm component (oci://ghcr.io/kubeflow/charts/kubeflow-trainer:2.2.0) — already renders the promotedregistry.k8s.io/jobset/jobset:v0.11.0and is unaffected. The defect is the inconsistency: the in-repo self-install fixture references an ephemeral registry that bit-rotted. This change hardens that fixture to match.Fixes: #1430
Type of Change
Component(s) Affected
pkg/validator—validators/performance)Implementation Notes
installTrainerpasses each built manifest through a newrewriteJobSetStagingImagehelper before applying — swaps the staging repo prefix for the promoted production repo.Testing
Unit:
Live A/B on GKE
aicr-demo5(H100, COS), clean-cluster self-install path, same recipe (gke/h100/cos/training/kubeflow), only the validator image differs::edge)…/k8s-staging-images/jobset/jobset:v0.11.0registry.k8s.io/jobset/jobset:v0.11.0jobset-controller-managerImagePullBackOff(NotFound)Running 1/1passed— 338.22 GB/s (threshold 225,>= 250 → true), phase 2m34sValidator emitted
Rewrote JobSet image off staging registry from=…/k8s-staging-images/… to=registry.k8s.io/jobset/jobset. Result matches the issue's manual repro (~338 GB/s after hand-patching), confirming the image path was the sole defect.Risk Assessment
Rollout notes: N/A — no API, recipe, or user-facing surface change.
Follow-ups (out of scope, noted on #1430)
deleteTrainerdoesn't clean up cluster-scoped resources (ClusterRoles/Bindings, webhook configs), which then collide with a later Helm install of thekubeflow-trainercomponent.isTrainerInstalledto check controller readiness.Checklist
make testwith-race)make lint)git commit -S)