refactor(recipes): retire gb200-any-training.yaml wildcard (#1052)#1068
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR deletes the intent-scoped gb200-any-training overlay, updates recipes/overlays/gb200-any.yaml comments to document the retirement and the accelerator-scoped deployment-floor, expands TestOverlayValidationPhaseFloor comments about knownGaps and pending OCI performance gaps, and updates design/integrator/contributor docs and worked examples/diagrams to reference gb200-any and deployment-phase checks instead of the retired training wildcard. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
2986a18 to
5572563
Compare
5009c31 to
76de5af
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/contributor/data.md`:
- Line 490: Replace the awkward phrase "wants kept" in the wildcard override
note with clearer wording such as "wants to keep" (or "to be kept") so the
sentence reads: "a service-specific leaf must restate every `check` it wants to
keep AND every constraint..." — edit the sentence containing "wants kept" in the
docs content to use the improved phrasing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9ea4cec5-8326-40be-84f6-e432fee63511
📒 Files selected for processing (6)
docs/contributor/data.mddocs/design/005-overlay-refactoring.mddocs/integrator/recipe-development.mdpkg/recipe/validation_phase_floor_test.gorecipes/overlays/gb200-any-training.yamlrecipes/overlays/gb200-any.yaml
💤 Files with no reviewable changes (1)
- recipes/overlays/gb200-any-training.yaml
The criteria-wildcard overlay `gb200-any-training.yaml` carried a single `nccl-all-reduce-bw >= 720` constraint applied to every GB200 + training query regardless of service. The companion `b200-any-training.yaml` had the same shape. The pattern is misleading: each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and a single cross-service threshold is rarely correct for two fabrics. NCCL bandwidth thresholds belong on the concrete service-bound leaf, anchored to a measurement on that specific fabric. The fabric-independent deployment-phase floor (4 standard health checks + `Deployment.gpu-operator.version` pin) remains on `gb200-any.yaml` — that pattern (established by NVIDIA#1001) is correct because the same gpu-operator version requirement applies across every service for the accelerator. Changes: - Delete `recipes/overlays/gb200-any-training.yaml`. - Update `recipes/overlays/gb200-any.yaml` comment block: drop the "Companion to gb200-any-training.yaml" intro, explain why the intent-scoped sibling was retired. - Doc updates with the same rationale: - `docs/integrator/recipe-development.md` — switch the criteria-wildcard example from `gb200-any-training.yaml` to `gb200-any.yaml` (deployment-phase floor); add the "per-fabric values don't belong here" caveat. - `docs/contributor/data.md` — refresh the wildcard explanation section, the resolver-tracing example, and the Mermaid flowchart to use `gb200-any` throughout; rename `-any-` naming convention to allow `-any` (deployment-floor pattern). - `docs/design/005-overlay-refactoring.md` — drop the `b200-any-training` / `gb200-any-training` lines from the overlay tree and leave a brief historical note explaining the retirement across PRs NVIDIA#1004 (b200 wildcard) and NVIDIA#1052 (gb200 wildcard). - Update the `knownGaps` header comment in `pkg/recipe/validation_phase_floor_test.go` to document the OKE GB200 training performance data gap (warn-only in non-strict mode today; closes when NVIDIA#1007 lands an OCI testbed measurement). The map stays empty: the floor test treats missing performance as a `t.Log(WARN)` in default mode, so a knownGaps entry would be stale-flagged. The `b200-any-training.yaml` deletion is in flight via NVIDIA#1004 (PR NVIDIA#1053); the two issues are independent. Acceptance (per NVIDIA#1052): 1. `recipes/overlays/gb200-any-training.yaml` removed: yes. 2. `gb200-oke-training` perf coverage: option (b) — deferred to NVIDIA#1007 for real OCI measurements; covered as a warn-only floor gap today. 3. `go test ./pkg/recipe/... -run TestOverlayValidationPhaseFloor`: passes. 4. `aicr recipe --service oke --accelerator gb200 --intent training --format yaml`: resolves to 11 components, 6 overlays (base, monitoring-hpa, gb200-any, oke, oke-training, gb200-oke-training); carries the gpu-operator v25.10.0 floor and the standard 4 deployment checks via `gb200-any.yaml`. 5. Doc references updated. 6. `make qualify` clean. Fixes NVIDIA#1052 Related NVIDIA#1004, NVIDIA#1007
76de5af to
5de37c3
Compare
Summary
Retire the
gb200-any-training.yamlcriteria-wildcard overlay. NCCL bandwidth thresholds (the only content the wildcard carried) belong per-leaf, not on a cross-service wildcard — each service has a different network fabric (EFA on EKS, TCPXO on GKE, RoCE on OKE) and the same number is rarely correct for two fabrics.The fabric-independent deployment-phase floor (gpu-operator version pin + 4 standard health checks) remains on
gb200-any.yaml— that pattern (established by #1001) is correct because the version requirement applies across every service.Motivation / Context
Per #1052: the wildcard pattern was useful before any service-specific GB200 leaves existed but is misleading now. EKS GB200 training leaves carry per-leaf NCCL thresholds; the wildcard was load-bearing only for the OKE GB200 leaves (which had no perf block of their own). After the OCI testbed work in #1007 lands,
gb200-oke-training.yamlwill ship its own measurement-anchored constraint and the warn-only data gap closes.The companion
b200-any-training.yamlwas a stub — no concrete B200 service-bound leaves existed yet — and is retired by #1004 (PR #1053) as a natural consequence of adding the first concrete B200 leaf.Fixes: #1052
Related: #1004 (PR #1053 — retires the B200 wildcard), #1007 (OCI testbed; supplies the OKE GB200 NCCL measurement that closes the warn-only gap)
Type of Change
gb200-oke-*training leaves lose their inherited NCCL constraint, which is currently a warn-only floor recommendation in non-strict mode)Component(s) Affected
pkg/recipe) — floor-test header refresh only; no code changedocs/) — three pages updated to reflect the retirementImplementation Notes
Deletion.
recipes/overlays/gb200-any-training.yamlremoved. It carried onlyvalidation.performance.checks/constraintsfornccl-all-reduce-bw >= 720.OKE GB200 training data gap.
gb200-oke-training,gb200-oke-ubuntu-training, andgb200-oke-ubuntu-training-kubeflowno longer inherit a performance block. Took option (b) from the issue: defer the real OCI testbed measurement to #1007, leave the gap as a warn-only floor-test result in the meantime.TestOverlayValidationPhaseFloortreats missing performance ast.Log(WARN ...)in default (non-strict) mode, so aknownGapsentry would be stale-flagged by the hygiene guard — instead, theknownGapsheader comment inpkg/recipe/validation_phase_floor_test.godocuments the pending gap and notes that a strict-mode CI toggle would need three entries referencing #1007.gb200-any.yamlcomment refresh. Dropped the "Companion to gb200-any-training.yaml" intro; explains why the intent-scoped sibling was retired and points at #1052.Doc updates (all three pages the issue called out):
docs/integrator/recipe-development.md— switch the criteria-wildcard example fromgb200-any-training.yamltogb200-any.yaml(the accelerator-scoped deployment floor). Add the caveat that per-fabric values don't belong on wildcards.docs/contributor/data.md— refresh the wildcard explanation section, the resolver-tracing prose example, and the Mermaid flowchart to usegb200-anythroughout. Update the naming convention to mention-any(deployment-floor pattern) alongside-any-<intent>.docs/design/005-overlay-refactoring.md— removeb200-any-trainingandgb200-any-trainingfrom the overlay tree; brief historical note explains the retirement (cross-referencing feat(recipes): add concrete GKE B200 service-bound overlays #1004 and refactor(recipes): retire*-any-training.yamlwildcards (NCCL thresholds per-leaf) #1052).Out of scope.
b200-any-training.yamldeletion: handled by PR feat(recipes): add concrete GKE B200 service-bound overlays #1053 (issue feat(recipes): add concrete GKE B200 service-bound overlays #1004). Independent of this PR per the coordination note in refactor(recipes): retire*-any-training.yamlwildcards (NCCL thresholds per-leaf) #1052.a100/l40: those accelerators have no overlays at all today (tracked in feat(recipes): add concrete A100 service-bound overlays #1002 / feat(recipes): add concrete L40 / L40S service-bound overlays #1003).gb200-oke-trainingperf measurements: blocked on feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007 OCI testbed.Testing
Risk Assessment
intent: training). Trivial to revert (one file restore + comment rollbacks).Rollout notes: No flag, no opt-in. The next
aicr recipe --service oke --accelerator gb200 --intent traininghydration produces a recipe without the cross-service NCCL constraint; everything else is identical. When #1007 closes with real OCI numbers,gb200-oke-training.yamlships its own performance block and the warn drains.Checklist
make testwith-race) — viamake qualifymake lint) — viamake qualifygit commit -S)