Skip to content

chore (validator): pin Trainer self-install JobSet image to promoted registry#1440

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/nccl-jobset-staging-1430
Jun 24, 2026
Merged

chore (validator): pin Trainer self-install JobSet image to promoted registry#1440
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/nccl-jobset-staging-1430

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Pin the JobSet controller image used by the NCCL performance validator's Kubeflow Trainer self-install to the promoted production registry (registry.k8s.io/jobset/jobset), aligning that path with AICR's kubeflow-trainer Helm chart and making the performance phase resilient on clusters that lack a pre-installed Trainer.

Motivation / Context

The performance validator can install Kubeflow Trainer itself when the cluster lacks it — an undocumented, ephemeral test fixture in validators/performance/trainer_lifecycle.go (install-if-absent → run NCCL test → tear down; gated only by a CRD existence check, no flag). That self-install builds the upstream Trainer v2.2.0 GitHub kustomize manifests, which pin the JobSet controller image to the Kubernetes staging registry us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset:v0.11.0. Staging tags are garbage-collected, so jobset-controller-manager enters ImagePullBackOff, its admission webhook (mjobset.kb.io) has no endpoints, and the NCCL TrainJob cannot start — the performance phase fails after a long timeout.

This is not a regression and is confined to one path. The supported, documented path — installing Trainer via AICR's kubeflow-trainer Helm component (oci://ghcr.io/kubeflow/charts/kubeflow-trainer:2.2.0) — already renders the promoted registry.k8s.io/jobset/jobset:v0.11.0 and is unaffected. The defect is the inconsistency: the in-repo self-install fixture references an ephemeral registry that bit-rotted. This change hardens that fixture to match.

Fixes: #1430

Type of Change

  • New feature / enhancement (hardening of the validator self-install fixture; aligns it with the Helm path)

Note: NVIDIA/aicr has no enhancement/bug repo labels; the conventional-commit type (feat) carries the classification.

Component(s) Affected

  • Validator (pkg/validatorvalidators/performance)

Implementation Notes

  • installTrainer passes each built manifest through a new rewriteJobSetStagingImage helper before applying — swaps the staging repo prefix for the promoted production repo.
  • Repo-prefix swap only: tag-agnostic (works for any future v0.x tag) and a no-op when the staging repo is absent (e.g. an upstream-fixed manifest).
  • Scoped strictly to the self-install path. The Helm/bundle path does not call this code, runs in a different process, and renders the promoted image already — zero behavioral impact there.

Testing

Unit:

go test ./validators/performance/ -run TestRewriteJobSetStagingImage -v   # PASS
golangci-lint run -c .golangci.yaml ./validators/performance/...          # 0 issues

Live A/B on GKE aicr-demo5 (H100, COS), clean-cluster self-install path, same recipe (gke/h100/cos/training/kubeflow), only the validator image differs:

Unfixed (:edge) Fixed (this PR)
JobSet image deployed …/k8s-staging-images/jobset/jobset:v0.11.0 registry.k8s.io/jobset/jobset:v0.11.0
jobset-controller-manager ImagePullBackOff (NotFound) Running 1/1
TrainJob never created (webhook no endpoints) launched: launcher + 2 workers on GPU nodes
Verdict fails passed — 338.22 GB/s (threshold 225, >= 250 → true), phase 2m34s

Validator emitted Rewrote JobSet image off staging registry from=…/k8s-staging-images/… to=registry.k8s.io/jobset/jobset. Result matches the issue's manual repro (~338 GB/s after hand-patching), confirming the image path was the sole defect.

Risk Assessment

  • Low — Isolated change on the validator's self-install path, covered by unit tests and a live A/B, trivially revertible. No effect on the bundle/Helm path.

Rollout notes: N/A — no API, recipe, or user-facing surface change.

Follow-ups (out of scope, noted on #1430)

  • deleteTrainer doesn't clean up cluster-scoped resources (ClusterRoles/Bindings, webhook configs), which then collide with a later Helm install of the kubeflow-trainer component.
  • Design question: whether the self-install fixture should exist at all vs. failing fast when Trainer is absent (consistent with the deployment-phase recommendation), and/or hardening isTrainerInstalled to check controller readiness.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (N/A — internal validator fixture)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new helper rewriteJobSetStagingImage is added to validators/performance/trainer_lifecycle.go. It uses bytes.ReplaceAll to substitute the garbage-collected JobSet staging registry prefix (us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset) with the promoted registry prefix (registry.k8s.io/jobset/jobset), logging when a substitution occurs. Two repository-prefix constants are declared, the bytes package is imported, and the helper is invoked inside installTrainer on each rendered manifest YAML before it is unmarshalled into unstructured.Unstructured. A corresponding test file adds table-driven and tag-preservation tests.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the bug, root cause, solution, testing, and risk assessment for the JobSet image registry rewrite.
Linked Issues check ✅ Passed The PR fully addresses the requirements in #1430: it rewrites the JobSet image from staging to promoted registry, implements a tag-agnostic solution, adds unit test coverage, and includes live A/B testing confirming the fix resolves the performance validator failure on clean clusters.
Out of Scope Changes check ✅ Passed All changes are scoped to the Trainer self-install path in trainer_lifecycle.go and its test file. The rewriteJobSetStagingImage helper and its tests are directly related to addressing the linked issue with no unrelated or extraneous modifications.
Title check ✅ Passed The title accurately describes the main change: pinning the Trainer self-install JobSet image to a promoted production registry instead of a garbage-collected staging registry.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/performance/trainer_lifecycle_test.go`:
- Around line 70-78: The test suite for the rewriteJobSetStagingImage function
currently only covers tag preservation with
TestRewriteJobSetStagingImage_PreservesTag, but the implementation contract
indicates that both tags and digests should be preserved. Add a new test
function (e.g., TestRewriteJobSetStagingImage_PreservesDigest) that verifies the
rewriteJobSetStagingImage function correctly preserves image digests in the
`@sha256`:... format when rewriting the staging image repository prefix, following
the same pattern as the existing tag preservation test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 28a169ec-7a2c-4e43-ad52-3b1008ba16f2

📥 Commits

Reviewing files that changed from the base of the PR and between 58721c5 and 9f3bc31.

📒 Files selected for processing (2)
  • validators/performance/trainer_lifecycle.go
  • validators/performance/trainer_lifecycle_test.go

Comment on lines +70 to +78
// TestRewriteJobSetStagingImage_PreservesTag verifies the rewrite is a repo-prefix swap
// that preserves the original tag.
func TestRewriteJobSetStagingImage_PreservesTag(t *testing.T) {
in := "image: " + jobSetStagingImageRepo + ":v0.11.0"
want := "image: " + jobSetPromotedImageRepo + ":v0.11.0"
if got := string(rewriteJobSetStagingImage([]byte(in))); got != want {
t.Errorf("got %q, want %q", got, want)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a digest-preservation case to match the helper contract.

The implementation comment says tag/digest are preserved, but tests only cover tags. Add one @sha256:... input case to lock this behavior.

Proposed test addition
 func TestRewriteJobSetStagingImage(t *testing.T) {
 	tests := []struct {
 		name        string
 		input       string
 		wantChanged bool
 	}{
@@
 		{
+			name:        "staging image with digest is repointed and digest preserved",
+			input:       "image: us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset@sha256:0123456789abcdef",
+			wantChanged: true,
+		},
+		{
 			name:        "already-promoted image is left untouched",
 			input:       "image: registry.k8s.io/jobset/jobset:v0.11.0",
 			wantChanged: false,
 		},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/performance/trainer_lifecycle_test.go` around lines 70 - 78, The
test suite for the rewriteJobSetStagingImage function currently only covers tag
preservation with TestRewriteJobSetStagingImage_PreservesTag, but the
implementation contract indicates that both tags and digests should be
preserved. Add a new test function (e.g.,
TestRewriteJobSetStagingImage_PreservesDigest) that verifies the
rewriteJobSetStagingImage function correctly preserves image digests in the
`@sha256`:... format when rewriting the staging image repository prefix, following
the same pattern as the existing tag preservation test.

…gistry

The NCCL performance validator can install Kubeflow Trainer itself when the
cluster lacks it — an undocumented, ephemeral test fixture in
validators/performance/trainer_lifecycle.go (install-if-absent, run, tear
down). That self-install builds the upstream Trainer v2.2.0 GitHub kustomize
manifests, which pin the JobSet controller image to the Kubernetes staging
registry (us-central1-docker.pkg.dev/k8s-staging-images/jobset/jobset:v0.11.0).
Those staging tags are garbage-collected, so jobset-controller-manager enters
ImagePullBackOff, its admission webhook has no endpoints, and the NCCL TrainJob
cannot start.

Rewrite the JobSet image repository onto the promoted production registry
(registry.k8s.io/jobset/jobset, same tag) before applying the manifests. This
aligns the self-install fixture with AICR's kubeflow-trainer Helm chart, which
already deploys the promoted image. The swap is a repo-prefix replacement, so
it is tag-agnostic and a no-op when the staging repo is absent.

Fixes NVIDIA#1430
@yuanchen8911 yuanchen8911 force-pushed the fix/nccl-jobset-staging-1430 branch from 9f3bc31 to 2cf2998 Compare June 23, 2026 23:54
@yuanchen8911 yuanchen8911 changed the title fix(validator): repoint JobSet image off GC'd staging registry feat(validator): pin Trainer self-install JobSet image to promoted registry Jun 23, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 23, 2026 23:56
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 23, 2026 23:56
@yuanchen8911 yuanchen8911 changed the title feat(validator): pin Trainer self-install JobSet image to promoted registry chore (validator): pin Trainer self-install JobSet image to promoted registry Jun 24, 2026
@mchmarny mchmarny enabled auto-merge (squash) June 24, 2026 12:03
@mchmarny mchmarny merged commit e927f39 into NVIDIA:main Jun 24, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden NCCL perf validator: pin Trainer self-install JobSet image to promoted registry

2 participants